\(^1\)Department of General Psychology, University of Padova, Italy
\(^2\)Department of Philosophy, Sociology, Pedagogy and Applied Psychology, University of Padova, Italy
\(^3\)Department of Communication Sciences, Humanities and International Studies, University of Urbino Carlo Bo, Italy
The present document includes the data pre-processing steps used to read the different types of data collected over three days from a sample of 105 healthy adults:
PrelQS = demographic & retrospective self-report
data collected with the preliminary questionnaire (Google
Forms)
ESM = experience sampling measures collected with
the Sensus
Mobile app (Xiong et al., 2016)
HRV = 2-min segments of blood volume pulse data
recorded with the E4 wristband (Empatica, Milan)
GNG = Go/No-Go behavioral data recorded with E-Prime
2.0.10 (Psychology Software Tools, Inc., Sharpsburg, PA)
For each data type, the following pre-processing steps are
implemented to generate the wide and long
datasets to be used for subsequent data analysis (see analytical reports
1. Psychometrics
and descriptives and 2. Regression
models):
Raw data files reading & merging
Raw data recoding & processing
Data filtering based on inclusion criteria and data quality
Here, we remove all objects from the R global environment, and we set the system time zone for better temporal synchronization across data files.
# removing all objets from the workspace
rm(list=ls())
# setting system time zone to GMT (for consistent temporal synchronization)
Sys.setenv(tz="GMT")
The following R packages are used in this document (see References section):
# required packages
packages <- c("ggplot2","gridExtra","jsonlite","plyr","lubridate","scales","knitr","reshape2","careless",
"shiny","pracma","Rmisc","lattice")
# generate packages references
knitr::write_bib(c(.packages(), packages),"packagesProc.bib")
# # run to install missing packages
# xfun::pkg_attach2(packages, message = FALSE); rm(list=ls())
Here, the retrospective self-report data collected with the
preliminary questionnaire are read and saved in the
PrelQS dataset, recoded, and filtered based on the
inclusion criteria.
library(ggplot2);library(gridExtra) # loading required packages
First, we read the Preliminary_qs.csv data file exported
from Google
Forms.
PrelQS <- read.csv("qs preliminare/Preliminary_qs.csv")
Second, we recode the PrelQS variables to be used for
the analyses.
As a first step, we recode participants’ identification
codes ID (currently corresponding to their e-mail
addresses) using the “HRVXXX” format (e.g., from ‘john.smith@gmail.com’ to ‘HRV001’). This is done with
the prel.qs_IDrecode() function. Since the participants’
e-mail addresses are confidential, both the function and the original
dataset are not showed.
# renaming the first two variables
colnames(PrelQS)[1:2] <- c("timestamp","ID")
# loading and using the function to recode the ID variable
source("prel.qs_IDrecode.R")
(PrelQS <- prel.qs_IDrecode(PrelQS))[1:3,] # showing first three rows
Second, we rename the other variables, remove those not considered for the present study, and recode the considered variables to be used in the analyses.
# renaming all variables
colnames(PrelQS) <- c("time","ID", # submission timestamp & participants ID
"consent", # consent to participate (all 'yes')
"age","gender","weight","height","itaMT", # demographics
# inclusion criteria
"drugs","drugs.which","cvDysf","cvDysf.which",
"otherDysf","otherDysf.which","phone","phone.which",
# retrospective scales (* = not considered in this study)
paste("DASS21",1:21,sep=""), # Depression, Anxiety, and Stress Scale*
paste("PANAS",1:20,sep=""), # Positive and Negative Affective Schedule*
paste("DERS",1:36,sep=""), # Difficulties in Emotion Regulation Scale*
paste("BIS11",c(1:15,17:20,22,23,25:30),sep=""), # Barratt Impulsiveness Scale-11 (3 items out)
paste("PSI",1:17,sep="")) # Physical Symptoms Inventory*
# keeping only considered variables (demographics and inclusion criteria)
PrelQS <- PrelQS[,c("ID","time",colnames(PrelQS)[which(colnames(PrelQS)=="age"):which(colnames(PrelQS)=="height")],
colnames(PrelQS)[which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")])]
# recoding variables
PrelQS[,c("ID","gender")] <- lapply(PrelQS[,c("ID","gender")],as.factor) # ID and gender as factor
PrelQS$time <- as.POSIXct(PrelQS$time,format="%Y/%m/%d %I:%M:%S %p") # time as POSIXct
PrelQS[PrelQS$height<3,"height"] <- PrelQS[PrelQS$height<3,"height"]*100 # correcting heights reported in meters
PrelQS <- cbind(PrelQS[,1:4],BMI=PrelQS$weight/(PrelQS$height/100)^2, # computing BMI, removing weight & height
PrelQS[,7:ncol(PrelQS)])
for(i in which(colnames(PrelQS)%in%c("drugs","cvDysf","otherDysf","phone"))){ PrelQS[,i] <- gsub("Sì","Yes",PrelQS[,i]) }
PrelQS[,which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")] <- # inclusion criteria as factors
lapply(PrelQS[,which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")],as.factor)
# sorting dataset by ID and time
PrelQS <- PrelQS[order(PrelQS$ID,PrelQS$time),]
Third, we clean and filter the data based on the inclusion criteria and by focusing on those participants that were actually invited to participate to the daily protocol (N = 105).
First, we inspect the original number of cases in the
PrelQS dataset. The questionnaire was completed by a total
of 164 respondents, of which a subsample was invited to participate in
the daily protocol.
cat("Original No. of responses to the PrelQS =",nrow(PrelQS))
## Original No. of responses to the PrelQS = 164
First, we inspect and remove cases of double
responses (N = 3 couples of responses with the same
ID value). In these cases, only the first response (i.e.,
the one with the earilest timestamp) is included.
# detecting double responses
cat(nrow(PrelQS[duplicated(PrelQS$ID),]),"cases of double responses (same ID)")
## 3 cases of double responses (same ID)
# removing double responses
memory <- PrelQS
PrelQS <- PrelQS[!duplicated(PrelQS$ID),]
cat("Removed",nrow(memory)-nrow(PrelQS),"double responses")
## Removed 3 double responses
Second, we take a look at the variables concerning the inclusion criteria of the study:
Not suffering from dysfunctions (e.g., anxiety disorder) or taking medications affecting the nervous system (e.g., antidepressants)
Not suffering from dysfunctions (e.g., hypertension) or taking medications affecting the cardiovascular system (e.g., beta blockers)
Owning an Android or iOS smartphone
In section 1.2.1., we recoded participants identification codes and
we marked the cases of participants not meeting the inclusion criteria,
as well as other cases of participants that were not invited to take
part in the EMA protocol for other reasons, using the
“OUT” label in the ID variable.
From a total of 161 responses to the preliminary questionnaire, we can
see that 105 participants (65%) were actually invited to participate in
the daily protocol.
cat("Total No. of responses to the PrelQS =",nrow(PrelQS),"\n -",
nrow(PrelQS[substr(PrelQS$ID,1,3)!="OUT",]),"invited\n -",nrow(PrelQS[substr(PrelQS$ID,1,3)=="OUT",]),"not invited")
## Total No. of responses to the PrelQS = 161
## - 105 invited
## - 56 not invited
Here, we better evaluate the reasons for the exclusion of the remaining 56 participants.
9 participants were not invited because they reported suffering from cardiovascular (i.e., premature heart beat, problems with the mitral valve) or other dysfunction (i.e., thalassemia, anxiety, Crohn’s disease) and/or taking antidepressants
1 participant was not invited because she did not own a smartphone
Further 46 participants were not invited to join the EMA protocol for other reasons (e.g., calendar incompatibilities, end of the data collection)
Here, we compare the demographic and retrospective variables between
included and excluded participants. Most participants that were not
invited to take part in the daily protocol were female
(probably due to the higher No. of female compared to undergraduates
participating to psychological studies, and our goal of having a
balanced sample), with no marked differences in terms of
gender, age, BMI
# preparing data
PrelQS$excl <- "IN" # creating excl variable (factor)
PrelQS[substr(PrelQS$ID,1,3)=="OUT","excl"] <- "OUT"
PrelQS$excl <- as.factor(PrelQS$excl)
# plotting
grid.arrange(ggplot(PrelQS,aes(x=excl,fill=gender)) + geom_bar(),
ggplot(PrelQS,aes(x=excl,y=age,fill=excl)) + geom_violin(),
ggplot(PrelQS,aes(x=excl,y=BMI,fill=excl)) + geom_violin(),nrow=1)
# removing excl
PrelQS$excl <- NULL
Here, we show the No. of participants not invited to take part in the
EMA protocol due to dysfunctions affecting either the
cardiovascular or the nervous system. We can see that
4 females were excluded since they reported suffering from a
cardiovascular dysfunction including premature heart
beat (027, 055) and problems with the mitral
valve (033, 053). 3 further females (4.61%)
were excluded since they reported suffering from other dysfunctions
affecting the nervous system, including thalassemia
(009), anxiety (020), Crohn’s disease
(049).
# inclusion criteria variables
ic <- colnames(PrelQS)[which(colnames(PrelQS)=="drugs"):which(colnames(PrelQS)=="phone.which")]
# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$cvDysf=="Yes", c("ID","gender",ic)]
# other dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$otherDysf=="Yes", c("ID","gender",ic)]
Here, we show the No. of participants not invited to take part in the
EMA protocol due to medications affecting either the
cardiovascular or the nervous system. 3 participants
(017, F; 029, M; 033, M) were
excluded since they took antidepressants (Laroxyl,
Venlafaxina, Remeron), one of which also suffered from a cardiovascular
dysfunction.
# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$drugs=="Yes", c("ID","gender",ic)]
Here, we show the No. of participants not invited to take part in the
EMA protocol due to medications affecting either the
cardiovascular or the nervous system. Only 1 female
(051) was excluded since she reported to not own a personal
smarpthone.
# cardiovascular dysfunctions
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & PrelQS$phone=="No", c("ID","gender",ic)]
The further 46 participants marked with "OUT" were not
invited due to other reasons (e.g., calendar incompatibility, end of the
study). Here, we inspect the data of such participants and we
remove them from the dataset. We can see that these
participants were mainly females (probably due to the higher No. of
female compared to undergraduates participating to psychological
studies, and our goal of having a balanced sample), with no other
peculiarities compared to the rest of the sample.
# showing participants that were not invited for other reasons
PrelQS[substr(PrelQS$ID,1,3)=="OUT" & !(PrelQS$ID%in%c("OUT033","OUT053","OUT027","OUT055","OUT009","OUT020",
"OUT049","OUT017","OUT029","OUT051")),]
Third, we remove all participants marked as “OUT” from the
Prelqs dataset.
new <- PrelQS[!substr(PrelQS$ID,1,3)=="OUT",]
cat("Filtered",nrow(PrelQS)-nrow(new),"cases; new number of cases =",nrow(new))
## Filtered 56 cases; new number of cases = 105
PrelQS <- new
Fourth, we flag further participants that are not excluded but whose inclusion might bias some of the results.
First, some of the included participants reported suffering from dysfunctions or taking medications that were considered irrelevant for the current study. Here, we better inspect those conditions (N = 9 and 13, respectively):
6 included participants (4 females, 2 males) reported suffering from cardiovascular dysfunctions: arrhythmia, tachycardia episodes, and bicuspid aortic valve
3 included participants (2 females, 1 male) reported suffering from other dysfunctions: hypothyroidism, asthma, and allergy
6 females reported taking hormonal contraceptives
8 participants took other medications including antihistamines (2 males, 5 females), and thyroid hormones (EUTIROX, 1 female)
# cardiovascular dysfunctions
PrelQS[PrelQS$cvDysf=="Yes", c("ID","gender",ic)]
# other dysfunctions
PrelQS[PrelQS$otherDysf=="Yes", c("ID","gender",ic)]
# drugs
PrelQS[PrelQS$drugs=="Yes", c("ID","gender",ic)]
Although we considered such conditions as potentially irrelevant for
the present study, some of them (especially cardiovascular and other
dysfunctions) might play a role. Thus, we create the dysfun
variable to flag cases of included participants with
cardiovascular and/or other dysfunctions.
# recoding dysfun variable (accounting for both cardiovascular and other dysfunctions)
PrelQS$dysfun <- 0
PrelQS[PrelQS$cvDysf=="Yes" | PrelQS$otherDysf=="Yes","dysfun"] <- 1
# recoding drugs variable
PrelQS$drugs <- gsub("Yes","1",gsub("No","0",PrelQS$drugs))
# removing unnecessary variables
PrelQS[,c("cvDysf","otherDysf","cvDysf.which","otherDysf.which","drugs.which","phone","phone.which")] <- NULL
# summary of dysfun and drugs
PrelQS[,c("dysfun","drugs")] <- lapply(PrelQS[,c("dysfun","drugs")],as.factor) # both as factor
summary(PrelQS[substr(PrelQS$ID,1,3)=="HRV",c("dysfun","drugs")])
## dysfun drugs
## 0:96 0:92
## 1: 9 1:13
Second, four of the included participants (4.1%) lost one protocol
day (i.e., day 3 for HRV013, HRV021, and
HRV091, day 1 for HRV021), which was postponed
to a following week. Here, we mark these participants with
postponed = 1.
PrelQS$postponed <- 0
PrelQS[PrelQS$ID %in% c("HRV013","HRV021","HRV067","HRV091"),"postponed"] <- 1
Finally, 8 further participants were excluded due to
dropping-out during the EMA protocol (009, F;
059, M), poor quality of physiological data
(015, F; 046, F; 066, M) or
technical problems with the mobile app (011, M;
027, M; 029, M). Here, we mark these
participants as ‘OUT’.
# marking 8 excluded participants as "OUT"
memory <- PrelQS
PrelQS$ID <- as.character(PrelQS$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
PrelQS[PrelQS$ID%in%excl,"ID"] <- gsub("HRV","OUT",PrelQS[PrelQS$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
PrelQS$ID <- as.factor(PrelQS$ID)
PrelQS <- PrelQS[order(PrelQS$ID),]
# updating number of included participants
cat(nrow(PrelQS[substr(PrelQS$ID,1,3)=="HRV",]),"included participants out of",
nrow(memory[substr(memory$ID,1,3)=="HRV",]),"invited participants")
## 97 included participants out of 105 invited participants
Here, the raw data collected with the experience sampling
method (ESM) are read and saved in the ESM
dataset, recoded, and filtered based on response rate and inclusion
criteria.
First, we read the JSON data files exported from the Sensus Mobile app (Xiong et al., 2016).
Specifically, the Protocols
created with Sensus were configured to store the recorded .JSON data
files in our private AWS S3
bucket, from which they were downloaded and stored in the
ESM/data folder. From the Sensus app, we also exported the
Probe
Definition files, which we use to replace the input IDs
(i.e., by default, each item is associated with an alphanumeic code)
with the corresponding input names (i.e., the item labels that
we set in the protocols). These files were saved in the
ESM/probe folder.
The readSurveyData function is used to optimize the ESM
data reading.
readSurveyData
#' @title Read JSON data exported from the Sensus mobile app
#' @param data.path = character string indicating the full path to the folder where the JSON raw data are stored.
#' @param probe.definition = character string indicating the full path to the folder where Probe Definition JSON file(s) (from the Sensus mobile app: Protocol > Probe > Scripted Interaction > Share definition).
#' @param messages = logical value indicating whether the processing steps should be printed (default: TRUE).
readSurveyData <- function(data.path,probe.definition,messages=TRUE){ require(jsonlite)
# 1. Reading data
# .......................................
options(digits.secs=3) # setting No. of digits
paths = list.files(data.path,recursive=TRUE,full.names=TRUE,include.dirs=FALSE) # listing files in data path
if(messages==TRUE){ cat("\n\nReading",length(paths),"data files from",data.path,"...")}
var.names <- c("ParticipantId","Timestamp","InputId","Response","RunTimestamp","SubmissionTimestamp",
"ScriptName","ProtocolId","$type") # selecting variables of interest
data <- as.data.frame(matrix(nrow=0,ncol=9)) # empty dataframe creation and population
colnames(data) <- var.names
for(path in paths){
if(file.info(path)$size>0){ # only reading Datum files (i.e., containing ScriptDatum, which is sized > 0 Kb)
new.data <- read_json(path,simplifyDataFrame=TRUE)
if(class(new.data)=="data.frame" & !is.null(new.data$Response)){ # only keeping files with Response data
if(class(new.data$Response)=="data.frame"){ # sometimes the Response column is read as dataframe
new.data$Response <- as.character(new.data$Response$`$values`)}
data <- rbind(data,new.data[var.names]) }}}
data <- data[!is.na(data$ParticipantId),] # only keeping data with ParticipantId information (i.e., participant identification)
data <- data[!is.na(data$InputId),] # only keeping data with InputId information (i.e., item identification)
names(data)[9] <- "os" # $type as OS (android or iOS) # mobile OS information
data[,9] <- gsub("Sensus.Probes.User.Scripts.ScriptDatum, Sensus","",data[,9]) # removing unuseful information
# 2. From Response Ids to Item labels (based on Probe Definition)
# ...............................................................
if(!is.na(probe.definition)){ if(messages==TRUE){ cat("\n\nConverting ResponseId into InputId based on Probe Definition...")}
# function to read Probe Definition files
readProbe <- function(path){ probedefinition <- read_json(path,simplifyDataFrame=TRUE) # first probe definition file
# reading input labels of the first inputGroup
inputs <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[1]]$Inputs$`$values`[[1]]$Name
infos <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[1]] # input labels
PROTOCOL <- data.frame(protocolName=probedefinition$Protocol$Name,protocolId=probedefinition$Protocol$Id, # protocol name
scriptName=infos$Name,inputName=inputs,inputId=infos$Inputs$`$values`[[1]]$Id) # script name
if(length(probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`)>1){ # reading the following inputGroups
for(i in 2:length(probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`)){
inputs <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[i]]$Inputs$`$values`[[1]]$Name
infos <- probedefinition$ScriptRunners$`$values`$Script$InputGroups$`$values`[[i]]
PROTOCOL <- rbind(PROTOCOL,data.frame(protocolName=probedefinition$Protocol$Name,
protocolId=probedefinition$Protocol$Id,
scriptName=infos$Name,inputName=inputs,
inputId=infos$Inputs$`$values`[[1]]$Id)) }}
return(PROTOCOL) }
paths = list.files(probe.definition,recursive=TRUE,full.names=TRUE,include.dirs=FALSE) # listing files in probe.definition path
PROTOCOL <- readProbe(paths[1]) # reading the first Probe Definition file
if(length(list.files(probe.definition))>1){ # reading the following Probe Definition files when > 1
for(path in paths[2:length(paths)]){ PROTOCOL <- rbind(PROTOCOL,readProbe(path)) }}
for(i in 1:nrow(data)){
for(j in 1:nrow(PROTOCOL)){ if(!is.na(data[i,3]) & data[i,3]==PROTOCOL[j,5]){ data[i,3] <- as.character(PROTOCOL[j,4]) }}}}
# 3. Cleaning and unlisting Response data
# ...............................................................
if(messages==TRUE){ cat("\n\nUnlisting Response data and removing system information...")}
data$Response <- gsub("list","",data$Response) # cleaning categorical items from Sensus system info
data$Response <- gsub(paste("c","\\(|\\)",sep=""),"",data$Response)
data$Response <- gsub("\\(|\\)","",data$Response)
data$Response <- gsub("\\[|\\]","",data$Response)
data$Response <- gsub("\\$type` = \"System.Collections.Generic.List`1System.Object, mscorlib, mscorlib\", ","",data$Response)
data$Response <- gsub("\\$values","",data$Response)
data$Response <- gsub('``` = ', "",data$Response)
data$Response <- gsub('\ ', "",data$Response)
data$Response <- gsub('\"', "",data$Response)
data$Response <- gsub('\"No\"', "No",data$Response)
data$Response <- gsub('\"Sì\"', "Si",data$Response)
if(class(data$Response)=="data.frame"){ data$Response <- as.character(data$Response$`$values`[[1]]) # unlisting Response column
} else { data$Response <- as.character(data$Response) }
# 4. Encoding time information
# ...............................................................
if(messages==TRUE){ cat("\n\nConverting Timestamp variables as POSIXct...")}
timestamps <- c("Timestamp","RunTimestamp","SubmissionTimestamp") # timestamps variables
data[,timestamps] <- lapply(data[,timestamps],function(x) as.POSIXct(x,format="%Y-%m-%dT%H:%M:%OS",tz="GMT"))
# 5. Cleaning incomplete double responses
# ...............................................................
# in some cases, multiple responses are recorded with the same ParticipantId, RunTimeStamp, and InputId
if(messages==TRUE){ cat("\n\nLooking for double responses...\n")}
data$id.r <- paste(data$ParticipantId,data$RunTimestamp,data$InputId,sep="_") # ID-time-item identifier
data$id.t <- paste(data$ParticipantId,data$RunTimestamp,sep="_") # ID-time identifier
doubles <- levels(as.factor(data[duplicated(data$id.r),"id.t"])) # double responses (same ID-time-item identifier)
ndoubles <- rep(length(doubles),2) # No. of double responses
for(double in doubles){ subId <- levels(as.factor(data[data$id.t==double,"SubmissionTimestamp"])) # submission time
if(length(subId)>1){ ndoubles[1] <- ndoubles[1] - 1
maxDouble <- max(as.POSIXct(subId)) # when the SubmissionTimestamp is different -> splitting responses
data[data$id.t==double,"RunTimestamp"] <- maxDouble - 61.20 # 2nd RunTimeStamp = Submission - 61.20 sec (median resp time)
} else { ndoubles[2] <- ndoubles[2] - 1
data <- rbind(data[data$id.t!=double,], # when even the SubmissionTimestamp is different -> removing double resp.
data[data$id.t==double & !duplicated(data$id.r),]) }}
cat(" - Removed",ndoubles[1],"double responses (same ID, InputName, RunTimestamp, & SubmissionTimestamp)\n",
"- Splitted",ndoubles[2],"double responses with different SubmissionTimestamp\n",
" (in these cases, RunTimestamp is recomputed as SubmissionTimestamp - 61.20 sec (average response length)")
# 6. Sorting columns and Reshaping
# ...............................................................
if(messages==TRUE){ cat("\n\nSorting columns and reshaping (one row per data entry)...")}
colnames(data)[1] <- "ID" # selecting and sorting columns
data <- data[,c("ID","os","ProtocolId","ScriptName","RunTimestamp","SubmissionTimestamp","InputId","Response","Timestamp")]
data <- data[order(data$ID,data$Timestamp),] # sorting rows
wide <- reshape(data[,1:ncol(data)-1],v.names=c("Response"),timevar=c("InputId"), # reshaping from long to wide (resp-by-resp)
idvar=c("RunTimestamp","SubmissionTimestamp"),direction=c("wide"),sep="")
colnames(wide) <- gsub("Response","",gsub("\\-",".",colnames(wide))) # removing label "Response" from ResponseId
wide <- wide[order(wide$ID,wide$RunTimestamp),] # sorting rows by participant ID and RunTimestamp
# wide <- data.frame(cbind(wide[1:3],wide[ncol(data)],wide[5:ncol(data)-1])) # final sorting of rows and columns
wide <- wide[order(wide$ID,wide$RunTimestamp),]
row.names(wide) <- as.character(1:nrow(wide))
if(messages==TRUE){ cat("\n\nRead",nrow(wide),"responses from",nlevels(as.factor(data$ID)),"participants.") }
# 7. Processing response times
# ...............................................................
data$RT <- as.numeric(NA) # computing item response times (i.e., seconds between each response and the previous one)
for(i in 2:nrow(data)){ if(data[i,"ID"]==data[i-1,"ID"] & data[i,"RunTimestamp"]==data[i-1,"RunTimestamp"]){
data[i,"RT"] <- difftime(data[i,"Timestamp"],data[i-1,"Timestamp"],units="secs")}}
data$RT.tot <- as.numeric(NA) # computing survey total response time (i.e., last - first response of each survey)
data$survey <- as.factor(paste(data$ID,data$RunTimestamp,sep="_"))
for(survey in levels(data$survey)){ data[data$survey==survey,"RT.tot"] <-
difftime(data[data$survey==survey,"SubmissionTimestamp"][1],
data[data$survey==survey,"Timestamp"][1],units="secs")}
rt <- reshape(data[,c("ID","RunTimestamp","SubmissionTimestamp","RT.tot","InputId","RT")], # reshaping
v.names=c("RT"),timevar=c("InputId"),idvar=c("RunTimestamp","SubmissionTimestamp"),direction=c("wide"),sep="")
colnames(rt)[5:ncol(rt)] <- gsub("RT","",colnames(rt)[5:ncol(rt)]) # removing label "Timestamp" from ResponseId
# 7. Returning processed dataset and response time
# ...............................................................
return(list(wide,rt))}
Here, we use the readSurveyData function (depicted
above) to read and preliminary recode the .JSON data files, and to merge
them into the ESM dataset. In one single case, the JSON
string contains (illegal) UTF8 byte-order-mark.
# data reading, encoding and saving
ESM <- readSurveyData(data.path="ESM/data",probe.definition="ESM/probe")
## Loading required package: jsonlite
##
##
## Reading 3200 data files from ESM/data ...
## Warning: JSON string contains (illegal) UTF8 byte-order-mark!
##
##
## Converting ResponseId into InputId based on Probe Definition...
##
## Unlisting Response data and removing system information...
##
## Converting Timestamp variables as POSIXct...
##
## Looking for double responses...
## - Removed 2 double responses (same ID, InputName, RunTimestamp, & SubmissionTimestamp)
## - Splitted 10 double responses with different SubmissionTimestamp
## (in these cases, RunTimestamp is recomputed as SubmissionTimestamp - 61.20 sec (average response length)
##
## Sorting columns and reshaping (one row per data entry)...
##
## Read 1392 responses from 104 participants.
ESM.RT <- ESM[[2]]
ESM <- ESM[[1]]
We also integrate the ESM dataset with 4
additional responses that were sent via instant messages (i.e.,
screenshot of the responses) due to technical problems. These were
stored in the screenshotX4.rda file.
load("ESM/screenshot_surevyMalfunctioning/screenshotX4.RData") # loading additiona responses
cat("Adding",nrow(screenshotX4),"additional responses sent via instant messages") # N = 4
## Adding 4 additional responses sent via instant messages
ESM <- rbind(ESM,screenshotX4) # adding additional responses to the ESM dataset
Finally, we import the “baseline surveys” data
(i.e., ESM data collected in the lab at the beginning of each day, using
Google Forms),
exported from Google Form as a .CSV file. Some recoding procedures will
be necessary before merging the baseline and the
ESM datasets.
# reading raw dataset exported from Google Forms
baseline <- read.csv("baseline/Baseline.csv",header=TRUE)
cat("Reading",nrow(baseline),"responses to the baseline questionnaire from",
nlevels(as.factor(baseline$Nome.utente)),"participants")
## Reading 311 responses to the baseline questionnaire from 108 participants
Second, we recode the ESM and baseline
variables to be used for the analyses, and we merge the two
datasets.
The following packages and functions are used to optimize the ESM data recoding:
library(lubridate); library(plyr); library(scales)
within.day.adjust
within.day.adjust <- function(data){
# Creating within.day.indicator (dafault = 0)
data$within.day <- NA
# adjusting within.day based on the scheduled times
for(i in 1:nrow(data)){
if(strftime(data[i,"RunTimestamp",], # survey 1 between 11:20 (- 10 min error) and 11:40 (up to 12.00 + 10min error)
format="%H:%M:%S")>strftime("1970-01-01 11:10:00",
format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
format="%H:%M:%S")<strftime("1970-01-01 12:10:00",
format="%H:%M:%S")){
data[i,"within.day"] = 1 }
else if(strftime(data[i,"RunTimestamp",], # survey 2 between 12:30 (- 10min error) and 12:50 (up to 13:10 + 10min error)
format="%H:%M:%S")>strftime("1970-01-01 12:20:00",
format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
format="%H:%M:%S")<strftime("1970-01-01 13:20:00",
format="%H:%M:%S")){
data[i,"within.day"] = 2 }
else if(strftime(data[i,"RunTimestamp",], # survey 3 between 13:40 (- 10min error) and 14:00 (up to 14:20 + 10min error)
format="%H:%M:%S")>strftime("1970-01-01 13:30:00",
format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
format="%H:%M:%S")<strftime("1970-01-01 14:30:00",
format="%H:%M:%S")){
data[i,"within.day"] = 3}
else if(strftime(data[i,"RunTimestamp",], # survey 4 between 14:50 (- 10min error) and 15:10 (up to 15.30 + 10min error)
format="%H:%M:%S")>strftime("1970-01-01 14:40:00",
format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
format="%H:%M:%S")<strftime("1970-01-01 15:40:00",
format="%H:%M:%S")){
data[i,"within.day"] = 4 }
else if(strftime(data[i,"RunTimestamp",], # survey 5 between 16:00 (- 10min error) and 16:20 (up to 16:40 + 10min error)
format="%H:%M:%S")>strftime("1970-01-01 15:50:00",
format="%H:%M:%S") & strftime(data[i,"RunTimestamp",],
format="%H:%M:%S")<strftime("1970-01-01 16:50:00",
format="%H:%M:%S")){
data[i,"within.day"] = 5 }}
# sorting dataset based on ID, day.of.week and within.day
data <- data[order(data$ID,data$day.of.week,data$within.day),]
data$within.day <- as.factor(data$within.day)
rownames(data) <- 1:nrow(data)
cat("Creating/Adjusting within.day variable...\n Printing summary of cases:\n")
print(summary(data$within.day))
return(data) }
Creates or adjusts the within.day variable based on the
scheduled temporal windows used in the current study.
varsRecoding
varsRecoding <- function(data){
# ID and os as factors
data[,c("ID","os")] <- lapply(data[,c("ID","os")],as.factor)
# Recoding item scores and reversed items
ratings <- colnames(data)[c(which(colnames(data)=="t3.nervoso.tranquillo"), # recoding ratings as integer
which(colnames(data)=="e3.affaticato.fresco"):which(colnames(data)=="intensity.NE"),
which(colnames(data)=="v1.male.bene"):which(colnames(data)=="v3.positivo.negativo"))]
data[,ratings] <- lapply(data[,ratings],as.integer)
data$v3.positivo.negativo <- 8 - data$v3.positivo.negativo # HEDONIC TONE = positive
colnames(data)[which(colnames(data)=="v2.soddisfatto.insoddisfatto")] <- "v2.insoddisfatto.soddisfatto" # changing wrong label
data$t1.rilassato.teso <- 8 - data$t1.rilassato.teso # CALMNESS = positive
data$e2.pieno.privodenergia <- 8 - data$e2.pieno.privodenergia # ENERGETIC AROUSAL = positive
# Recoding confounders
data$alcohol <- as.factor(gsub("No","0",gsub("Sì","1",data$alcohol))) # alcohol (from yes/no --> 1/0)
data$Activity <- as.factor(gsub("Leggerasedutiocoricatiperlamaggiorpartedeltempo","0", # physical activity level (0 = seated)
gsub("Moderatacamminata,farelescale","1", # (1 = moderate)
gsub("Intensaattivitàsportiva,palestra,corsaodiverserampediscale","2", # (2 = vigorous)
data$Activity))))
data$People.rec <- as.factor(gsub("No,erodasola","0",gsub("No,erodasolo","0", # (0=alone)
gsub("Sì,manonsonostatadisturbata","1",gsub("Sì,manonsonostatadisturbato","1", # (1=not interact)
gsub("Sì,esonostataunpo'disturbata","2",gsub("Sì,esonostataunpo'disturbato","2", # (2=disturb)
gsub("Sì,esonostatainterrotta","3",gsub("Sì,esonostatainterrotto","3", #(3 = interrupted)
data$People.rec)))))))))
data$ProtocolId <- as.factor(gsub("ProtocolStudent","",data$ProtocolId)) # from protocol ID to gender
colnames(data)[which(colnames(data)=="ProtocolId")] <- "gender"
data[,c("smoke","coffe")] <- lapply(data[,c("smoke","coffe")],as.factor) # remaining confounders as factor 0/1
# Sorting and renaming columns
data <- data[,c("ID","os","gender","day.of.week","within.day","RunTimestamp", # response information
"v1.male.bene","v2.insoddisfatto.soddisfatto","v3.positivo.negativo", # hedonic tone
"t1.rilassato.teso","t2.agitato.calmo","t3.nervoso.tranquillo", # calmness
"e1.stanco.sveglio","e2.pieno.privodenergia","e3.affaticato.fresco", # energetic arousal
"PE","intensity.PE","NE","intensity.PE", # emotional events
"smoke","coffe","alcohol","Activity","People.rec")] # confounders
colnames(data)[7:ncol(data)] <- c("v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
"PE","PE.int","NE","NE.int","smoke","coffe","alcohol","activity","people")
# final sorting and returning data
data$sort <- as.numeric(substr(data$ID,4,6))
data <- data[order(data$sort,data$RunTimestamp),] # sorting by ID and timestamp
data$sort <- NULL # removing sort column
return(data) }
Recodes slider and multiple-choice responses collected with the Sensus protocols used in the study, sort the columns and fix problems due to Daylight Time changes.
bslRecoding
bslRecoding <- function(data){
# renaming and sorting columns
colnames(data) <- c("RunTimestamp","ID","v1","t1","f1","v2","t2","f2","v3","t3","f3",
"PE","PE.int","NE","NE.int","Sleep1","Sleep2","Sleep3","daybefore.ex","drugs","drugs.which","day.of.week","within.day")
data <- data[,c("ID","day.of.week","within.day","RunTimestamp",
"v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
"PE","PE.int","NE","NE.int","Sleep1","Sleep2","Sleep3","daybefore.ex","drugs","drugs.which")]
# ID as factor
data$ID <- as.factor(data$ID)
# recoding item scores
data$v2 <- 8 - data$v2 # hedonic tone = positive
data$v3 <- 8 - data$v3
data$t2 <- 8 - data$t2 # t.arousal = negative
data$t3 <- 8 - data$t3
data$f2 <- 8 - data$f2 # e.arousal = positive
data$Sleep2 <- 8 - data$Sleep2 # sleep quality = positive
data$Sleep3 <- 8 - data$Sleep3
# other confounding variables
data$drugs <- as.factor(gsub("Sì","1",gsub("No","0",data$drugs))) # drugs
data$daybefore.ex <- as.factor(gsub("Sì","1",gsub("No","0",data$daybefore.ex))) # intense exercise on the day before
# sorting by timestamp
data <- data[order(data$ID,data$RunTimestamp),]
return(data)}
First, we recode participants’ identification codes
ID (currently corresponding to their e-mail addresses)
using the “HRVXXX” format (e.g., from ‘john.smith@gmail.com’ to ‘HRV001’). This is done with
the participantID_recoding() and the
participantID_recoding_baseline() functions. Again, since
the participants’ e-mail addresses are confidential, both the function
and the original dataset are not showed.
# loading and using the function to recode the ID variable
source("participantID_recoding.R"); source("participantID_recoding_baseline.R")
(ESM <- participantID_recoding(ESM))[1:3,] # recoding ESM IDs and showing first three rows
colnames(baseline)[2] <- "ID"
(baseline <- participantID_recoding_baseline(baseline))[1:3,] # recoding baseline IDs and showing first three rows
Second, we process the timestamps (i.e., temporal
variables) in both the EMA and the baseline
datasets.
Here, we convert the included timestamps from character to POSIXct (the R format for dates and times), using the “GMT” time zone (note that data were collected in the CET/CEST region, but GMT is easier to handle).
# converting temporal variables as POSIXct in the ESM dataset
ESM[,c("RunTimestamp","SubmissionTimestamp")] <- # no need to specify format since they already are POSIXct
lapply(ESM[,c("RunTimestamp","SubmissionTimestamp")],function(x) as.POSIXct(x,tz="GMT"))
# converting baseline variables as POSIXct in the baseline dataset
colnames(baseline)[which(colnames(baseline)=="Informazioni.cronologiche")] <- "RunTimestamp"
baseline$RunTimestamp <- as.POSIXct(baseline$RunTimestamp,format="%Y/%m/%d %I:%M:%S %p",tz="GMT")
Then, we inspect the data for cases of wrongly encoded
timestamps, considering that data collection started on
November 6th, 2018 and ended on November 21th, 2019. Only two cases are
detected, whose RunTimestamp was recorded with the wrong
year.
# min and max RunTimestamp
c(min(ESM$RunTimestamp),max(ESM$RunTimestamp)) # ESM: unexpected values < November 2018
## [1] "2018-01-24 16:40:00.000 GMT" "2019-11-21 15:11:17.835 GMT"
c(min(baseline$RunTimestamp),max(baseline$RunTimestamp)) # baseline: ok
## [1] "2018-11-06 10:12:16 GMT" "2019-11-21 10:45:48 GMT"
# only two cases with RunTimestamp < November 2018 and no SubmissionTimestamp
ESM[ESM$RunTimestamp < as.POSIXct("2018-11-06") | ESM$RunTimestamp > as.POSIXct("2019-11-22"),
c("ID","RunTimestamp","SubmissionTimestamp")]
# plotting all timestamps from these two participants: isolated point recorded as it was collected one year before
grid.arrange(ggplot(ESM[ESM$ID=="HRV046",],aes(RunTimestamp)) + geom_histogram(bins=30) + ggtitle("HRV046"),
ggplot(ESM[ESM$ID=="HRV048",],aes(RunTimestamp)) + geom_histogram(bins=30) + ggtitle("HRV048"))
# correcting wrongly encoded RunTimestamps (adding 1year) and SubmissionTimestamps (RunTimestamp + 61.20 sec)
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01","SubmissionTimestamp"] <-
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01","RunTimestamp"] + 61.20
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] <-
ESM[substr(ESM$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] + 24*60*60*365
c(min(ESM$RunTimestamp),max(ESM$RunTimestamp)) # sanity check: min and max matching with the data collection period
## [1] "2018-11-06 10:18:46.547 GMT" "2019-11-21 15:11:17.835 GMT"
# 2 further cases with missing SubmissionTimestamp (replacing with RunTimestamp + 61.20 sec)
ESM[is.na(ESM$SubmissionTimestamp),c("ID","RunTimestamp","SubmissionTimestamp")]
ESM[is.na(ESM$SubmissionTimestamp),"SubmissionTimestamp"] <- ESM[is.na(ESM$SubmissionTimestamp),"RunTimestamp"] + 61.20
Second, we adjust the temporal coordinates
accounting for daylight saving time. Indeed, whereas
timestamps were recorded based on the internal clock of participants’
mobile phones, which automatically synchronizes based on time changes,
such time shifts got lost in the data exporting, resulting in
one-hour shifts during standard time. In Italy, during
the data collection period, daylight saving time was applied from March
31th to October 27th 2019. Here, we visually inspect the timestamp of
the first response for both the EMA and the
baseline datasets.
We can note that in the EMA dataset, the responses
recorded during standard time were encoded with
one hour less than the actual time (i.e., the first
questionnaire was scheduled at 11:30, whereas most responses were
recorded at 10:30), whereas those recorded during daylight
saving time show one additional hour less than
the actual time (i.e., most responses recorded at 9:30). In the
baseline dataset, the responses recorded during
daylight saving time were recorded with one
hour less than the actual time (i.e., baseline questionnaires
were administered around 10:30, whereas most responses are around 9:30),
whereas those recorded during standard time look
fine.
# ESM: computing and plotting time without date
ESM$daytime <- as.POSIXct(paste(hour(ESM$RunTimestamp),minute(ESM$RunTimestamp)),format="%H %M",tz="GMT")
ESM$daylight <- "standard"
ESM[ESM$RunTimestamp > as.POSIXct("2019-04-01") & ESM$RunTimestamp < as.POSIXct("2019-10-27"),"daylight"] <- "daylight"
ggplot(ESM[!duplicated(ESM$ID),],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
ggtitle("ESM$RunTimestamp of the first response for each participant")
# baseline: computing and plotting time without date
baseline$daytime <- as.POSIXct(paste(hour(baseline$RunTimestamp),minute(baseline$RunTimestamp)),format="%H %M",tz="GMT")
baseline$daylight <- "standard"
baseline[baseline$RunTimestamp>as.POSIXct("2019-04-01") & baseline$RunTimestamp<as.POSIXct("2019-10-27"),"daylight"]<-"daylight"
ggplot(baseline[,],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
ggtitle("baseline$RunTimestamp")
Here, we adjust the temporal coordinates in both datasets according to the observed time shifts.
# adding 1 hour to all ESM timestamps
ESM[,c("RunTimestamp","SubmissionTimestamp")] <- ESM[,c("RunTimestamp","SubmissionTimestamp")] + 1*60*60
# adding 1 further hour to ESM timestamps during daylightsaving time
ESM[ESM$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] <-
ESM[ESM$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] + 1*60*60
# adding 1 hour to baseline timestamps during daylightsaving time
baseline[baseline$daylight=="daylight","RunTimestamp"] <- baseline[baseline$daylight=="daylight","RunTimestamp"] + 1*60*60
# ESM: computing and plotting time without date (NOW OK)
ESM$daytime <- as.POSIXct(paste(hour(ESM$RunTimestamp),minute(ESM$RunTimestamp)),format="%H %M",tz="GMT")
ggplot(ESM[!duplicated(ESM$ID),],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
ggtitle("ESM$RunTimestamp of the first response for each participant")
# baseline: computing and plotting time without date (NOW OK)
baseline$daytime <- as.POSIXct(paste(hour(baseline$RunTimestamp),minute(baseline$RunTimestamp)),format="%H %M",tz="GMT")
ggplot(baseline[,],aes(y=daytime,fill=daylight)) + geom_bar(position=position_dodge(),width=50) +
scale_y_datetime(labels = date_format("%H:%M"),date_breaks="60 min") +
ggtitle("baseline$RunTimestamp of the first response for each participant")
# removing daytime and daylight columns
ESM$daytime <- ESM$daylight <- baseline$daytime <- baseline$daylight <- NULL
Finally, we create two variables to better organize the temporal structure of the datasets:
day.of.week indicating the weekday indicator (i.e.,
1 = Tuesday, 2 = Wednesday, 3 = Thursday)
within.day consisting in a row identifier within
each day. For baseline responses, we set
within.day = 0
# creating indicator for the weekday
ESM$day.of.week <- as.POSIXlt(ESM$RunTimestamp)$wday - 1
baseline$day.of.week <- as.POSIXlt(baseline$RunTimestamp)$wday - 1
levels(as.factor(c(levels(as.factor(ESM$day.of.week)),levels(as.factor(baseline$day.of.week))))) # sanity check: ok
## [1] "1" "2" "3"
# creating row identifier within each day
ESM <- ddply(ESM,c("ID","day.of.week"),transform,within.day=seq_along(day.of.week))
baseline$within.day <- 0 # assigning zero to the baseline questionnaires
# showing data structure
ESM[,c("ID","day.of.week","within.day","RunTimestamp")]
However, the method used above to encode the within.day
variable counts the survey as they are received (i.e., first, second,
third), and not as they were scheduled (i.e., based on the
RunTimestamp variable), neglecting missing responses.
Here, we use the within.day.adjust() function showed at
the beginning of section 2.2 to recode the within.day
variable accounting for the following scheduled temporal
windows and the expiration time (i.e., after
20 minutes), and we add 10 extra minutes before and
after the following time limits to account for the variability among
devices:
0 = 09:45 ~ 11:00 (baseline survey)
1 = 11:20 - 11:40 + 20 min (up to 12:00)
2 = 12:30 - 12:50 + 20 min (up to 13:10)
3 = 13:40 - 14:00 + 20 min (up to 14:20)
4 = 14:50 - 15:10 + 20 min (up to 15:50)
5 = 16:00 - 16:20 + 20 min (up to 16:40)
Here, we apply the function and we correct one case of missing
within.day (i.e., RunTimestamp = 17:40;
recorded as within.day = 5).
# adjusting within.day based on scheduled time windows
ESM <- within.day.adjust(ESM)
## Creating/Adjusting within.day variable...
## Printing summary of cases:
## 1 2 3 4 5 NA's
## 275 280 288 267 285 1
# correcting 1 missing case (17:40 -> within.day = 5)
ESM[is.na(ESM$within.day),"within.day"] <- 5
# showing data structure
ESM[,c("ID","day.of.week","within.day","RunTimestamp")]
Finally, we use the varsRecoding() and the
bslRecoding() functions showed at the beginning of section
2.2 to recode the remaining variables (i.e., renaming and reversing item
scores, recoding categorical variables, sorting and renaming columns) in
the ESM and the baseline datasets,
respectively.
# recoding EMA dataset
ESM <- varsRecoding(ESM)
# recoding baseline dataset
baseline <- bslRecoding(baseline)
Here, we conduct a series of sanity checks by inspecting the summary
and plots of each variable. In the ESM dataset, we can see
that 11 cases from the same participant HRV099 have both
coffe and alcohol = "NULL", whereas the values
of the other confounders look fine. Since this is likely due to
unanswered items because the condition was not met, we recode those
cases with coffe and alcohol = 0. Moreover, we
can note that in some cases participants selected more than one option
for the people variable. In these cases, we recode the
score in a conservatory way, that is by assigning the higher score.
# inspecting summary of ESM variables
summary(ESM)
## ID os gender day.of.week within.day
## HRV083 : 19 Android:1110 F:702 Min. :1.00 1:275
## HRV032 : 17 iOS : 286 M:694 1st Qu.:1.00 2:280
## HRV036 : 16 Median :2.00 3:288
## HRV073 : 16 Mean :1.99 4:267
## HRV095 : 16 3rd Qu.:3.00 5:286
## HRV003 : 15 Max. :3.00
## (Other):1297
## RunTimestamp v1 v2
## Min. :2018-11-06 11:18:46.54 Min. :1.000 Min. :1.000
## 1st Qu.:2018-12-18 13:54:35.94 1st Qu.:4.000 1st Qu.:4.000
## Median :2019-03-13 16:15:10.50 Median :5.000 Median :5.000
## Mean :2019-03-22 16:02:26.94 Mean :4.981 Mean :4.569
## 3rd Qu.:2019-05-21 15:05:04.84 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :2019-11-21 16:11:17.83 Max. :7.000 Max. :7.000
## NA's :13
## v3 t1 t2 t3
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :5.000 Median :5.000 Median :5.000
## Mean :4.734 Mean :4.787 Mean :4.827 Mean :4.853
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:6.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :14 NA's :8 NA's :13 NA's :23
## f1 f2 f3 PE
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.000
## Median :4.000 Median :4.000 Median :4.000 Median :1.000
## Mean :4.255 Mean :4.239 Mean :4.167 Mean :1.856
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
## NA's :10 NA's :14 NA's :23 NA's :24
## PE.int NE NE.int smoke coffe
## Min. :1.000 Min. :1.000 Min. :1.000 0 :1167 0 :1101
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1 : 157 1 : 221
## Median :1.000 Median :1.000 Median :1.000 2 : 10 2 : 8
## Mean :1.798 Mean :1.393 Mean :1.798 5 : 1 NULL: 11
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:2.000 NULL: 11 NA's: 55
## Max. :7.000 Max. :7.000 Max. :7.000 NA's: 50
## NA's :27 NA's :36 NA's :27
## alcohol activity people
## 0 :1329 0 :921 1 :698
## 1 : 9 1 :388 0 :345
## NA's: 58 2 : 29 2 :249
## NA's: 58 3 : 35
## 1,0 : 2
## (Other): 7
## NA's : 60
# cases with coffe or alcohol = "NULL"
ESM[!is.na(ESM$coffe) & ESM$coffe=="NULL",c(1,20:ncol(ESM))]
ESM[!is.na(ESM$coffe) & ESM$coffe=="NULL",c("smoke","coffe")] <- "0"
ESM[,c("smoke","coffe")] <- lapply(lapply(ESM[,c("smoke","coffe")],as.character),as.factor)
# inspecting summary of people
summary(ESM$people)
## 0 0,1 0,1,2 0,2 1 1,0 2 2,1 2,3 3 3,0 3,1 NA's
## 345 1 1 1 698 2 249 1 1 35 1 1 60
ESM$people <- as.factor(gsub("0,1","0", # replacing multi-option responses with the higher selected score
gsub("0,1,2","2",
gsub("0,2","2",
gsub("1,0","1",
gsub("2,1","2",
gsub("2,3","3",
gsub("3,0","3",
gsub("3,1","3",ESM$people)))))))))
# re-inspecting recoded variables (no more "NULL" and multi-choice responses)
summary(ESM[,c("smoke","coffe","people")])
## smoke coffe people
## 0 :1178 0 :1112 0 :346
## 1 : 157 1 : 221 1 :700
## 2 : 10 2 : 8 2 :252
## 5 : 1 NA's: 55 3 : 38
## NA's: 50 NA's: 60
# inspecting summary of baseline variables (all look fine)
summary(baseline)
## ID day.of.week within.day RunTimestamp
## HRV001 : 3 Min. :1.000 Min. :0 Min. :2018-11-06 10:12:16.0
## HRV002 : 3 1st Qu.:1.000 1st Qu.:0 1st Qu.:2018-12-18 10:19:24.5
## HRV003 : 3 Median :2.000 Median :0 Median :2019-03-13 10:27:06.0
## HRV004 : 3 Mean :1.997 Mean :0 Mean :2019-03-22 08:11:48.5
## HRV005 : 3 3rd Qu.:3.000 3rd Qu.:0 3rd Qu.:2019-05-21 10:33:18.5
## HRV006 : 3 Max. :3.000 Max. :0 Max. :2019-11-21 10:45:48.0
## (Other):293
## v1 v2 v3 t1 t2
## Min. :2.00 Min. :2.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.00 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:2.000
## Median :5.00 Median :4.000 Median :5.000 Median :3.000 Median :3.000
## Mean :4.91 Mean :4.466 Mean :4.746 Mean :3.251 Mean :3.093
## 3rd Qu.:6.00 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :7.00 Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
##
## t3 f1 f2 f3 PE
## Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:3.00 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.000
## Median :3.000 Median :5.00 Median :5.000 Median :4.000 Median :2.000
## Mean :2.987 Mean :4.28 Mean :4.347 Mean :4.212 Mean :2.199
## 3rd Qu.:4.000 3rd Qu.:5.00 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :7.000 Max. :7.00 Max. :7.000 Max. :7.000 Max. :6.000
##
## PE.int NE NE.int Sleep1 Sleep2
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :2.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:5.00
## Median :1.000 Median :1.000 Median :1.000 Median :5.000 Median :6.00
## Mean :1.929 Mean :1.707 Mean :1.712 Mean :4.514 Mean :5.91
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:6.000 3rd Qu.:7.00
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :7.000 Max. :7.00
## NA's :2
## Sleep3 daybefore.ex drugs drugs.which
## Min. :1.00 0:242 0:297 Length:311
## 1st Qu.:3.00 1: 69 1: 14 Class :character
## Median :4.00 Mode :character
## Mean :3.91
## 3rd Qu.:5.00
## Max. :7.00
##
From the summary above, we can see that some categorical confounders
have only a very small number of ratings for certain levels (i.e.,
smoke has only 11 cases rated as “2” or higher,
coffe has only 8 cases rated as “2”, activity has only 29
cases rated as “2”, people has only 3 cases rated as “3”). Here, we
simplify such variables by dichotomizing each of them
(i.e., 0/1).
# plotting frequency of cases in each level of each confounder
par(mfrow=c(1,4))
for(conf in c("smoke","coffe","activity","people")){ plot(ESM[,conf],main=conf) }
# smoke: 2 and 5 --> 1 (i.e., "1+ cigarettes")
ESM$smoke <- as.factor(gsub("2","1",gsub("5","1",ESM$smoke)))
# coffe: 2 --> 1 (i.e., "1+ coffes")
ESM$coffe <- as.factor(gsub("2","1",ESM$coffe))
# activity: 2 --> 1 (i.e., "moderate/intense physical activity")
ESM$activity <- as.factor(gsub("2","1",ESM$activity))
# people: 1 --> 0 (i.e., "alone or with some people not disturbing"); 2 and 3 --> 1 (i.e., "someone disturbing or interrupting the recording")
ESM$people <- as.factor(gsub("3","1",gsub("2","1",gsub("1","0",ESM$people))))
# re-plotting
par(mfrow=c(1,4))
for(conf in c("smoke","coffe","activity","people")){ plot(ESM[,conf],main=conf) }
# re-summarizing
summary(ESM[,c("smoke","coffe","activity","people")])
## smoke coffe activity people
## 0 :1178 0 :1112 0 :921 0 :1046
## 1 : 168 1 : 229 1 :417 1 : 290
## NA's: 50 NA's: 55 NA's: 58 NA's: 60
Finally, we can join the baseline dataset with the
ESM dataset.
# merging and fixing the last issues
ESM$within.day <- as.integer(as.character(ESM$within.day)) # within.day as integer for compatibility
new <- rbind.fill(ESM,baseline) # merging datasets
cat("sanity check:",nrow(new)==(nrow(ESM)+nrow(baseline))) # sanity check
## sanity check: TRUE
ESM <- new[order(new$ID,new$day.of.week,new$within.day),] # sorting by ID, day.of.week, and within.day
rownames(ESM) <- 1:nrow(ESM)
# showing merged dataset (everithing looks fine)
ESM
Third, we clean and filter the data based on the inclusion criteria. The following packages and functions are used to optimize data filtering:
library(plyr); library(knitr); library(reshape2); library(ggplot2); library(gridExtra); library(careless)
respRate
respRate <- function(data){
# scheduled No. of surveys (1 baseline + 5 ordinary X 3 days)
tot <- (1 + 5) * 3
bsl <- 1 * 3
ord <- 5 * 3
# computing No. of responses and response rate
data$ID <- as.factor(as.character(data$ID))
rr <- data.frame(ID=levels(data$ID),tot=as.numeric(table(data$ID))) # No. of responses per participant
rr$tot.RR <- round(100*rr$tot/tot) # total response rate
rr$bsl <- as.numeric(table(data[data$within.day==0,"ID"])) # No. of responses to the baseline surveys
rr$bsl.RR <- round(100*rr$bsl/bsl) # baseline response rate
rr$ord <- as.numeric(table(data[data$within.day!=0,"ID"])) # No. of responses to the ordinary surveys
rr$ord.RR <- round(100*rr$ord/ord) # ordinary response rate
# function to print response rate info
rr.message <- function(dat,what){
cat("\n\n",what,"participants: N =",nrow(dat),
"\n -",sum(dat$tot),"responses, from",min(dat$tot),"to",max(dat$tot),
", mean =",round(mean(dat$tot),2),"sd =",round(sd(dat$tot),2),
", mean perc =",round(mean(dat$tot.RR),2),"% sd =",round(sd(dat$tot.RR),2),"%",
"\n -",sum(dat$bsl),"baseline, from",min(dat$bsl),"to",max(dat$bsl),
", mean =",round(mean(dat$bsl),2),"sd =",round(sd(dat$bsl),2),
", mean perc =",round(mean(dat$bsl.RR),2),"% sd =",round(sd(dat$bsl.RR),2),"%",
"\n -",sum(dat$ord),"ordinary, from",min(dat$ord),"to",max(dat$ord),
", mean =",round(mean(dat$ord),2),"sd =",round(sd(dat$ord),2),
", mean perc =",round(mean(dat$ord.RR),2),"% sd =",round(sd(dat$ord.RR),2),"%") }
rr.message(rr[substr(rr$ID,1,3)=="HRV",],what="Included") # included participants
rr.message(rr[substr(rr$ID,1,3)!="HRV",],what="Excluded") # excluded participants
return(rr)} # returning response rate dataset
Computes a data frame of response rate values and prints summary information on response rate.
plotResp
plotResp <- function(data,id,items){
long <- melt(data[,c("id",items)],id) # one row per item
long <- long[order(long$id,long$variable),]
ggplot(long,aes(x=value,y=variable)) + geom_vline(xintercept=4,lwd=0.5,lty=2) + geom_point() + xlim(1,7) +
facet_wrap(id) + theme(text=element_text(size=8)) }
Plots the pattern of responses to mood items for a subsample of responses.
We start by inspecting the original number of respondents and
observations in the ESM dataset. We can see that the
questionnaires were completed by a total of 104
respondents, one less than the 105 participants that were
actually invited to participate to the daily protocol. This is because
one participant HRV009 because she dropped out without
activating the app.
cat("Original No. of responses to the ESM/baseline surveys =",nrow(ESM),"from",nlevels(ESM$ID),"participants")
## Original No. of responses to the ESM/baseline surveys = 1707 from 104 participants
First, we inspect and remove cases of double
responses with the same ID and
RunTimestamp values (N = 10). In these cases, only the
first response is included with the exception of HRV007 on
day 1 and HRV073 on day 2, because of missing values in the
first but not in the second response.
# detecting double responses
ESM$idday <- paste(ESM$ID,ESM$RunTimestamp,sep="_")
cat(nrow(ESM[duplicated(ESM$idday),]),"cases of double responses (same ID and RunTimestamp)")
## 10 cases of double responses (same ID and RunTimestamp)
kable(ESM[ESM$idday%in%ESM[duplicated(ESM$idday),"idday"],
c("ID","day.of.week","within.day","RunTimestamp","v1","v2","v3","t1","t2","t3","NE.int","smoke","coffe","alcohol","activity")])
| ID | day.of.week | within.day | RunTimestamp | v1 | v2 | v3 | t1 | t2 | t3 | NE.int | smoke | coffe | alcohol | activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 108 | HRV007 | 1 | 5 | 2018-11-13 16:21:31.569 | 4 | 4 | 4 | 5 | 5 | 5 | 1 | 0 | 0 | 0 | 0 |
| 109 | HRV007 | 1 | 5 | 2018-11-13 16:21:31.569 | 3 | 3 | 3 | 4 | 3 | 3 | NA | NA | NA | NA | NA |
| 121 | HRV007 | 3 | 5 | 2018-11-15 16:21:49.293 | 3 | 3 | 3 | 3 | 4 | 3 | 1 | 0 | 0 | 0 | 0 |
| 122 | HRV007 | 3 | 5 | 2018-11-15 16:21:49.293 | 3 | 2 | 3 | 3 | 3 | 2 | 1 | 0 | 0 | 0 | 0 |
| 492 | HRV032 | 1 | 2 | 2018-12-18 12:54:50.792 | 5 | 5 | 6 | 5 | 5 | 4 | 3 | 1 | 0 | 0 | 1 |
| 493 | HRV032 | 1 | 2 | 2018-12-18 12:54:50.792 | 1 | 1 | 7 | 7 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 566 | HRV036 | 2 | 1 | 2019-01-09 11:33:48.782 | 7 | 5 | 7 | 7 | 7 | 7 | 1 | 0 | 0 | 0 | 0 |
| 567 | HRV036 | 2 | 1 | 2019-01-09 11:33:48.782 | 7 | 5 | 7 | 7 | 7 | 7 | 1 | 0 | 0 | 0 | 0 |
| 1169 | HRV073 | 2 | 3 | 2019-05-08 13:45:50.606 | 5 | 5 | 5 | 4 | 4 | 5 | 2 | 0 | 0 | NA | NA |
| 1170 | HRV073 | 2 | 3 | 2019-05-08 13:45:50.606 | 5 | 4 | 5 | 3 | 4 | 4 | 1 | 0 | 1 | 0 | 0 |
| 1252 | HRV078 | 2 | 2 | 2019-05-15 12:40:54.675 | 3 | 3 | 3 | 3 | 4 | 3 | 1 | 0 | 1 | 0 | 1 |
| 1253 | HRV078 | 2 | 2 | 2019-05-15 12:40:54.675 | 4 | 3 | 3 | 3 | 3 | 3 | 1 | 0 | 0 | 0 | 1 |
| 1329 | HRV083 | 1 | 2 | 2019-05-21 12:49:08.996 | 6 | 5 | 4 | 6 | 6 | 6 | 1 | 1 | 0 | 0 | 0 |
| 1330 | HRV083 | 1 | 2 | 2019-05-21 12:49:08.996 | 6 | 4 | 4 | 6 | 6 | 6 | 1 | 1 | 0 | 0 | 0 |
| 1337 | HRV083 | 2 | 3 | 2019-05-22 14:10:05.861 | 4 | 4 | 4 | 5 | 5 | 5 | 1 | 1 | 1 | 0 | 0 |
| 1338 | HRV083 | 2 | 3 | 2019-05-22 14:10:05.861 | 4 | 4 | 4 | 3 | 5 | 5 | 1 | 1 | 1 | 0 | 0 |
| 1342 | HRV083 | 3 | 1 | 2019-05-23 11:41:29.187 | 4 | 4 | 4 | 5 | 5 | 6 | 1 | 0 | 0 | 0 | 0 |
| 1343 | HRV083 | 3 | 1 | 2019-05-23 11:41:29.187 | 4 | 4 | 4 | 6 | 6 | 6 | 1 | 0 | 0 | 0 | 0 |
| 1347 | HRV083 | 3 | 5 | 2019-05-23 16:04:47.322 | 4 | 4 | 4 | 5 | 5 | 6 | 1 | 1 | 0 | 0 | 0 |
| 1348 | HRV083 | 3 | 5 | 2019-05-23 16:04:47.322 | 4 | 4 | 4 | 5 | 5 | 6 | 1 | 1 | 0 | 0 | 0 |
# excluding second response for HRV007 (day 1) and HRV073 (day 2)
memory <- ESM
ESM <- ESM[!(ESM$idday=="HRV007_2018-11-13 16:18:32.769" & is.na(ESM$activity)),]
ESM <- ESM[!(ESM$idday=="HRV073_2019-05-08 13:42:51.806" & is.na(ESM$activity)),]
# excluding all remaining double responses
ESM <- ESM[!duplicated(ESM$idday),]
# removing double responses
cat("Removed",nrow(memory)-nrow(ESM),"double responses")
## Removed 10 double responses
Then, we use the same procedure to inspect and remove cases of
double responses with the same ID,
day.of.week and within.day values. We can see
10 further cases of double responses. Since no missing data are included
in any of these cases, we can just filter the latest response of each
couple.
# detecting double responses
ESM$idday <- paste(ESM$ID,ESM$day.of.week,ESM$within.day,sep="_")
cat(nrow(ESM[duplicated(ESM$idday),]),"cases of double responses (same ID and RunTimestamp)")
## 10 cases of double responses (same ID and RunTimestamp)
kable(ESM[ESM$idday%in%ESM[duplicated(ESM$idday),"idday"],
c("ID","day.of.week","within.day","RunTimestamp","v1","v2","v3","t1","t2","t3","NE.int","smoke","coffe","alcohol","activity")])
| ID | day.of.week | within.day | RunTimestamp | v1 | v2 | v3 | t1 | t2 | t3 | NE.int | smoke | coffe | alcohol | activity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 491 | HRV032 | 1 | 2 | 2018-12-18 12:47:00.000 | 5 | 4 | 5 | 5 | 5 | 6 | 1 | 1 | 0 | 0 | 0 |
| 492 | HRV032 | 1 | 2 | 2018-12-18 12:54:50.792 | 5 | 5 | 6 | 5 | 5 | 4 | 3 | 1 | 0 | 0 | 1 |
| 502 | HRV032 | 2 | 5 | 2018-12-19 16:08:00.000 | 3 | 4 | 3 | 3 | 2 | 2 | 3 | 0 | 0 | 0 | 0 |
| 503 | HRV032 | 2 | 5 | 2018-12-19 16:09:17.886 | 5 | 5 | 5 | 6 | 3 | 5 | 1 | 1 | 0 | 0 | 0 |
| 739 | HRV046 | 3 | 5 | 2019-01-24 16:01:42.284 | 2 | 2 | 3 | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 1 |
| 740 | HRV046 | 3 | 5 | 2019-01-24 17:40:00.000 | 4 | 3 | 4 | 2 | 2 | 4 | 1 | 1 | 1 | 0 | 1 |
| 773 | HRV048 | 3 | 3 | 2019-01-31 13:40:00.000 | 4 | 2 | 3 | 3 | 4 | 5 | 1 | 0 | 0 | 0 | 0 |
| 774 | HRV048 | 3 | 3 | 2019-01-31 13:51:13.250 | 7 | 7 | 7 | 3 | 2 | 2 | 6 | 0 | 0 | 0 | 1 |
| 1085 | HRV068 | 2 | 2 | 2019-04-10 12:33:54.719 | 7 | 7 | 7 | 6 | 7 | 7 | 1 | 0 | 0 | 0 | 1 |
| 1086 | HRV068 | 2 | 2 | 2019-04-10 12:43:54.737 | 7 | 7 | 7 | 6 | 7 | 7 | 1 | 0 | 0 | 0 | 0 |
| 1169 | HRV073 | 2 | 3 | 2019-05-08 13:45:50.606 | 5 | 5 | 5 | 4 | 4 | 5 | 2 | 0 | 0 | NA | NA |
| 1171 | HRV073 | 2 | 3 | 2019-05-08 14:03:54.629 | 4 | 3 | 4 | 5 | 5 | 4 | 1 | 0 | 0 | 0 | 0 |
| 1525 | HRV094 | 3 | 3 | 2019-10-17 13:49:35.021 | 5 | 2 | 4 | 4 | 4 | 3 | 2 | 0 | 0 | 0 | 1 |
| 1526 | HRV094 | 3 | 3 | 2019-10-17 13:52:44.679 | 5 | 3 | 4 | 3 | 4 | 3 | 1 | 0 | 0 | 0 | 0 |
| 1527 | HRV094 | 3 | 4 | 2019-10-17 14:51:03.452 | 4 | 4 | 4 | 3 | 3 | 3 | 1 | 0 | 0 | 0 | 0 |
| 1528 | HRV094 | 3 | 4 | 2019-10-17 15:11:23.876 | 4 | 2 | 3 | 3 | 3 | 3 | 1 | 0 | 0 | 0 | 0 |
| 1531 | HRV095 | 1 | 2 | 2019-10-15 12:37:19.946 | 5 | 4 | 5 | 5 | 3 | 5 | 2 | 0 | 0 | 0 | 1 |
| 1532 | HRV095 | 1 | 2 | 2019-10-15 12:37:21.780 | 5 | 4 | 5 | 5 | 3 | 5 | 2 | 0 | 0 | 0 | 1 |
| 1533 | HRV095 | 1 | 2 | 2019-10-15 12:47:56.838 | 5 | 4 | 4 | 5 | 5 | 5 | 1 | 0 | 0 | 0 | 1 |
# excluding double responses
memory2 <- ESM
ESM <- ESM[!duplicated(ESM$idday),]
# removing double responses
cat("Removed further",nrow(memory2)-nrow(ESM),"double responses")
## Removed further 10 double responses
Thus, we removed a total of 20 double responses
# removing double responses
cat("Removed a total of",nrow(memory)-nrow(ESM),"double responses; new number of responses =",nrow(ESM))
## Removed a total of 20 double responses; new number of responses = 1687
Second, we inspect the daily.drugs and the
drugs.what variable (i.e., drugs/medications different than
those usually taken that were assumed on that morning) considering the
inclusion criteria of the study (i.e., not taking medications affecting
the nervous or the cardiovascular system).
Only 14 cases are present in the dataset, of which 7 (4 participants) concerned antiinflammatory medications (Antiinfiammatorio, moment, buscofen), 1 case concerned antibiotics (fosfomicina), 1 case concerned analgesics (paracetamolo), and 3 cases (1 participant) concerned cough syrup (sciroppo per la tosse). None of these cases contrast with our inclusion criteria, but they will be taken into account later, as they can influence some self-reported and physiological variables.
ESM[substr(ESM$ID,1,3)=="HRV" & ESM$within.day==0 & ESM$drugs==1,c("ID","drugs.which")]
Third, we mark the 8 participants that were excluded
for the reasons specified in section 1.3.4. As in that section, such
participants are marked as ‘OUT’. As noted above, participant
HRV009 was even not included in the ESM
dataset because she dropped out before activating the app. Thus, from
the initial 104 participants, the number of excluded participants is 7
(6.73%%), whereas the total number of included participants is 97
(93.27%).
# marking 8 excluded participants as "OUT"
memory <- ESM
ESM$ID <- as.character(ESM$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
ESM[ESM$ID%in%excl,"ID"] <- gsub("HRV","OUT",ESM[ESM$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
ESM$ID <- as.factor(ESM$ID)
ESM <- ESM[order(ESM$ID,ESM$RunTimestamp),]
# updating number of included participants
cat(nrow(ESM[substr(ESM$ID,1,3)=="HRV",]),"included observations from",
nlevels(as.factor(as.character(ESM[substr(ESM$ID,1,3)=="HRV","ID"]))),"participants out of",
nrow(memory),"total observations from",nrow(memory[!duplicated(memory$ID),]),"invited participants")
## 1607 included observations from 97 participants out of 1687 total observations from 104 invited participants
Finally, although we did not applied inclusion criteria based on further variables than those specified in section 1.3.2, here we consider the response rate and the coherence of the responses as further criteria to flag those cases that might bias the results.
First, we inspect the response rate by using the
respRate function showed at the beginning of section 2.3.
Considering included participants, we can see that all
‘baseline’ surveys (i.e., those administered in the lab) were
responded each day by each participant, whereas a some missing
data were present in ‘ordinary surveys’ (i.e., those
administered with the mobile app). Overall, the mean response rate to
‘ordinary’ surveys was relatively high (90%) with no
participants showing response rates < 60%. Thus, we see no
reasons for excluding or flagging further participants based on
response rate. In contrast, excluded participants showed substantially
lower response rates.
rrate <- respRate(ESM)
##
##
## Included participants: N = 97
## - 1607 responses, from 12 to 18 , mean = 16.57 sd = 1.41 , mean perc = 91.91 % sd = 7.79 %
## - 291 baseline, from 3 to 3 , mean = 3 sd = 0 , mean perc = 100 % sd = 0 %
## - 1316 ordinary, from 9 to 15 , mean = 13.57 sd = 1.41 , mean perc = 90.41 % sd = 9.32 %
##
## Excluded participants: N = 7
## - 80 responses, from 6 to 17 , mean = 11.43 sd = 4.31 , mean perc = 63.29 % sd = 24.11 %
## - 20 baseline, from 2 to 3 , mean = 2.86 sd = 0.38 , mean perc = 95.29 % sd = 12.47 %
## - 60 ordinary, from 4 to 14 , mean = 8.57 sd = 4.12 , mean perc = 57 % sd = 27.45 %
In addition to counting the number of received surveys, we must also consider missing responses within each survey. Indeed, a number of surveys was incomplete due to technical problems with the app. We can notice that missing responses mainly concern items on confounder variables (i.e., activity, alcohol, coffee, smoke and people), which were administered in the last part of each survey and showed from 49 to 59 missing values (about 4%). In contrast, missing data are only about the 2-3% for items measuring stressful events (from 26 to 35), and less than 2% for itmes measuring mood (from 0 to 2).
miss <- sapply(ESM[ESM$within.day!=0,1:which(colnames(ESM)=="people")], function(x) sum(is.na(x))) # counting missing
miss <- data.frame(item=names(miss),nMiss=as.numeric(miss), # missing rate
percMiss=round(100*as.numeric(miss)/nrow(ESM[ESM$within.day!=0,])))
kable(miss) # showing No. and percentage of missing cases
| item | nMiss | percMiss |
|---|---|---|
| ID | 0 | 0 |
| os | 0 | 0 |
| gender | 0 | 0 |
| day.of.week | 0 | 0 |
| within.day | 0 | 0 |
| RunTimestamp | 0 | 0 |
| v1 | 0 | 0 |
| v2 | 13 | 1 |
| v3 | 14 | 1 |
| t1 | 8 | 1 |
| t2 | 13 | 1 |
| t3 | 23 | 2 |
| f1 | 10 | 1 |
| f2 | 14 | 1 |
| f3 | 23 | 2 |
| PE | 24 | 2 |
| PE.int | 26 | 2 |
| NE | 35 | 3 |
| NE.int | 26 | 2 |
| smoke | 49 | 4 |
| coffe | 54 | 4 |
| alcohol | 57 | 4 |
| activity | 57 | 4 |
| people | 59 | 4 |
Here, we remove further 14 surveys (1.0%) due to missing data in almost all items (i.e., those data entries with missing data in the first items), of which only 7 (0.5%) were entered by included participants
# showing 14 cases with missing response to item v3
ESM[is.na(ESM$v3),c(1,which(colnames(ESM)=="v1"):ncol(ESM)),]
# excluding 14 cases with missing response to item v3
ESM <- ESM[!is.na(ESM$v3),]
Second, we flag participants and/or single responses that might be considered as careless respondents/responses.
Following Curran (2016), we consider multiple indicators of potentially careless responses including:
Response times (both item-specific and total)
Response consistency (long string proportions)
Multivariate outlier patterns (Mahalanobis distance)
Response inconsistency (psychometric synonyms)
For each indicator, the numbers between brackets indicates the corresponding number of flagged responses/participants.
First, we consider item-by-item and overall survey response
times in the ‘ordinary’ ESM questionnaires. This
information was encoded in the ESM.RT dataset in section
2.1. Thus, as a first step, we process those data by cleaning and
filtering the same cases cleaned and filtered in the ESM
dataset.
# recoding IDs
# ...................................................
ESM.RT <- participantID_recoding(ESM.RT)
# recoding time
# ...................................................
# converting temporal variables as POSIXct in the ESM dataset
ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] <- # no need to specify format since they already are POSIXct
lapply(ESM.RT[,c("RunTimestamp","SubmissionTimestamp")],function(x) as.POSIXct(x,tz="GMT"))
# correcting wrongly encoded RunTimestamps (adding 1year) and SubmissionTimestamps (RunTimestamp + 61.20 sec)
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01","SubmissionTimestamp"] <-
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01","RunTimestamp"] + 61.20
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] <-
ESM.RT[substr(ESM.RT$RunTimestamp,1,7)=="2018-01",c("RunTimestamp","SubmissionTimestamp")] + 24*60*60*365
# sanity check: min and max matching with the data collection period
c(min(ESM.RT$RunTimestamp),max(ESM.RT$RunTimestamp))
## [1] "2018-11-06 10:18:46.547 GMT" "2019-11-21 15:11:17.835 GMT"
# adding 1 hour to all ESM timestamps
ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] <- ESM.RT[,c("RunTimestamp","SubmissionTimestamp")] + 1*60*60
# adding 1 further hour to ESM timestamps during daylight saving time
ESM.RT$daylight <- "standard"
ESM.RT[ESM.RT$RunTimestamp > as.POSIXct("2019-04-01") & ESM.RT$RunTimestamp < as.POSIXct("2019-10-27"),"daylight"] <- "daylight"
ESM.RT[ESM.RT$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] <-
ESM.RT[ESM.RT$daylight=="daylight",c("RunTimestamp","SubmissionTimestamp")] + 1*60*60
# creating day.of.week
ESM.RT$day.of.week <- as.POSIXlt(ESM.RT$RunTimestamp)$wday - 1
# creating within.day
ESM.RT <- ddply(ESM.RT,c("ID","day.of.week"),transform,within.day=seq_along(day.of.week))
ESM.RT <- within.day.adjust(ESM.RT)
## Creating/Adjusting within.day variable...
## Printing summary of cases:
## 1 2 3 4 5
## 275 279 287 267 284
# recoding other variables
# ....................................................
# ID as factor
ESM.RT$ID <- as.factor(ESM.RT$ID)
# Sorting and renaming columns
ESM.RT <- ESM.RT[,c("ID","day.of.week","within.day","RunTimestamp","SubmissionTimestamp","RT.tot",
"v1.male.bene","v2.soddisfatto.insoddisfatto","v3.positivo.negativo", # hedonic tone
"t1.rilassato.teso","t2.agitato.calmo","t3.nervoso.tranquillo", # calmness
"e1.stanco.sveglio","e2.pieno.privodenergia","e3.affaticato.fresco", # energetic arousal
"PE","intensity.PE","NE","intensity.PE", # emotional events
"smoke","coffe","alcohol","Activity","People.rec")] # confounders
colnames(ESM.RT)[7:ncol(ESM.RT)] <- c("v1","v2","v3","t1","t2","t3","f1","f2","f3", # renaming columns
"PE","PE.int","NE","NE.int","smoke","coffe","alcohol","activity","people")
# filtering double responses
# ..................................................
# by RunTimestamp
ESM.RT$idday <- paste(ESM.RT$ID,ESM.RT$RunTimestamp,sep="_")
cat(nrow(ESM.RT[duplicated(ESM.RT$idday),]),"cases of double responses (same ID and RunTimestamp)") # 10 (= ESM)
## 10 cases of double responses (same ID and RunTimestamp)
ESM.RT <- ESM.RT[!(ESM.RT$idday=="HRV007_2018-11-13 16:18:32.769" & is.na(ESM.RT$activity)),] # excl 2nd resp for HRV007 & HRV073
ESM.RT <- ESM.RT[!(ESM.RT$idday=="HRV073_2019-05-08 13:42:51.806" & is.na(ESM.RT$activity)),]
ESM.RT <- ESM.RT[!duplicated(ESM.RT$idday),] # excluding all remaining double responses
# by day.of.week and within.day
ESM.RT$idday <- paste(ESM.RT$ID,ESM.RT$day.of.week,ESM.RT$within.day,sep="_")
cat(nrow(ESM.RT[duplicated(ESM.RT$idday),]),"cases of double responses (same ID and RunTimestamp)") # 6 (4 less than ESM (?))
## 6 cases of double responses (same ID and RunTimestamp)
ESM.RT <- ESM.RT[!duplicated(ESM.RT$idday),] # excluding all double responses
# marking further excluded
# ..................................................
# marking 8 excluded participants as "OUT"
ESM.RT$ID <- as.character(ESM.RT$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
ESM.RT[ESM.RT$ID%in%excl,"ID"] <- gsub("HRV","OUT",ESM.RT[ESM.RT$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
ESM.RT$ID <- as.factor(ESM.RT$ID)
ESM.RT <- ESM.RT[order(ESM.RT$ID,ESM.RT$RunTimestamp),]
# excluding cases with missing response to most items
# .................................................
ESM.RT <- ESM.RT[!is.na(ESM.RT$v3),]
# sanity checks
# .................................................
# same number of participants?
nlevels(ESM.RT$ID) == nlevels(ESM$ID)
## [1] TRUE
# same number of included participants?
nlevels(as.factor(as.character(ESM.RT[substr(ESM.RT$ID,1,3)=="HRV","ID"]))) ==
nlevels(as.factor(as.character(ESM[substr(ESM$ID,1,3)=="HRV","ID"])))
## [1] TRUE
# same number of observations?
nrow(ESM.RT) == nrow(ESM[ESM$within.day!=0,])
## [1] TRUE
Now we can inspect the item-by-item response times
(seconds) in the ESM.RT dataset, computed as the difference
between the timestamp of each response and that of the previous one
(note: the first response of each questionnaire does not have an
associated response time).
We can see that responding to a single item required about 2 seconds in most cases (median response time = 2.69 sec), regardless of the type of items. A small number of items were responded in more than 10 seconds (4.3%) with a very few cases requiring more than 50 seconds (N = 41, 0.18%). Likewise, a small number of items were responded in less than 1 second (2.4%) with a very few cases requiring less than 0.65 seconds (N = 109, 0.48%).
# item-by-item long dataset
rt <- melt(ESM.RT[,c("ID","RunTimestamp",colnames(ESM.RT)[which(colnames(ESM.RT)=="v1"):which(colnames(ESM.RT)=="people")])],
id=c("ID","RunTimestamp"))
rt <- rt[!is.na(rt$value),] # removing missing values
rt$items <- "mood" # categorizing variables based on the item contents
rt[rt$variable%in%c("PE","PE.int","NE","NE.int"),"items"] <- "event"
rt[rt$variable%in%c("smoke","coffe","alcohol","activity","people"),"items"] <- "conf"
# summarizing response times
summary(rt$value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.939 2.685 4.134 4.104 4378.792
# plotting response times
grid.arrange(ggplot(rt,aes(value))+geom_histogram(bins=300)+ggtitle("Response times (sec)"),
ggplot(rt,aes(value))+geom_histogram(bins=300)+xlim(0,100)+ggtitle("zooming 0 - 100 sec"),
ggplot(rt,aes(value))+geom_histogram(bins=300)+xlim(0,10)+ggtitle("zooming 0 - 10 sec"),
ggplot(rt[rt$items=="mood",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Mood items"),
ggplot(rt[rt$items=="event",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Event items"),
ggplot(rt[rt$items=="conf",],aes(value))+geom_histogram(bins=300,alpha=0.5)+xlim(0,10) + ggtitle("Confounders items"),
nrow=2)
# No. of response times higher than X
for(value in c(10,20,25,50,100,500,1000,2000,4000)){ cat("\n-",nrow(rt[rt$value>value,]),"RTs >",value,
"s (",round(100*nrow(rt[rt$value>value,])/nrow(rt),2),"% )") }
##
## - 989 RTs > 10 s ( 4.32 % )
## - 250 RTs > 20 s ( 1.09 % )
## - 156 RTs > 25 s ( 0.68 % )
## - 41 RTs > 50 s ( 0.18 % )
## - 18 RTs > 100 s ( 0.08 % )
## - 2 RTs > 500 s ( 0.01 % )
## - 2 RTs > 1000 s ( 0.01 % )
## - 1 RTs > 2000 s ( 0 % )
## - 1 RTs > 4000 s ( 0 % )
# No. of response times lower than X
for(value in c(2,1.5,1,0.65,0.5,0.25,0.1)){ cat("\n-",nrow(rt[rt$value<value,]),"RTs <",value,
"s (",round(100*nrow(rt[rt$value<value,])/nrow(rt),2),"% )") }
##
## - 6241 RTs < 2 s ( 27.28 % )
## - 2561 RTs < 1.5 s ( 11.2 % )
## - 538 RTs < 1 s ( 2.35 % )
## - 109 RTs < 0.65 s ( 0.48 % )
## - 38 RTs < 0.5 s ( 0.17 % )
## - 12 RTs < 0.25 s ( 0.05 % )
## - 12 RTs < 0.1 s ( 0.05 % )
Here, we flag all cases with at least one item
responded in more than 50 seconds with
iRT.slow = 1 and all those responded in less than
650 ms with iRT.fast = 1. While the first
threshold is essentially based on the observed RT distribution, the
second criterion was also recommended by Geeraerts
(2020). We also compute the number of items meeting each condition,
for each survey.
We can see that most cases have only one slow-response-time item (N =
26) with only 7 surveys showing 2 or more items that took more than 50
sec. In contrast, there is a similar proportion of surveys with either 1
(N = 39) or 2 items (N = 35) answered with less than 650 ms. In terms of
careless participants, we flag slow and fast
respondents considering participants with 3+ slow responses (N = 2) and
those with 3+ fast responses (N = 8 + one excluded participant).
Particular attention should be posed on participant HRV099,
showing 11 out of 14 responses with fast response times.
ESM.RT$idday <- as.factor(paste(ESM.RT$ID,ESM.RT$RunTimestamp,sep="_"))
rt$idday <- as.factor(paste(rt$ID,rt$RunTimestamp,sep="_"))
ESM.RT$iRT.slow <- ESM.RT$NiRT.slow <- ESM.RT$iRT.fast <- ESM.RT$NiRT.fast <- ESM.RT$slowResp <- ESM.RT$fastResp <- 0
for(idday in levels(ESM.RT$idday)){
if(any(rt[rt$idday==idday,"value"]>50)){ ESM.RT[ESM.RT$idday==idday,"iRT.slow"] <- 1
ESM.RT[ESM.RT$idday==idday,"NiRT.slow"] <- nrow(rt[rt$idday==idday & !is.na(rt$value) & rt$value>50,]) }
if(any(rt[rt$idday==idday,"value"]<0.65)){ ESM.RT[ESM.RT$idday==idday,"iRT.fast"] <- 1
ESM.RT[ESM.RT$idday==idday,"NiRT.fast"] <- nrow(rt[rt$idday==idday & !is.na(rt$value) & rt$value<0.65,]) }}
# slow response times (N = 33)
summary(as.factor(ESM.RT$NiRT.slow)) # summary of number of slow response times
## 0 1 2 3
## 1329 26 6 1
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$iRT.slow==1,"ID"])))) # summary of slow respondents
## HRV001 HRV003 HRV013 HRV014 HRV042 HRV050 HRV051 HRV054 HRV056 HRV060 HRV063
## 1 1 1 1 1 1 3 1 2 1 1
## HRV067 HRV070 HRV073 HRV078 HRV079 HRV080 HRV085 HRV088 HRV089 HRV092 HRV094
## 2 1 1 1 1 1 1 1 1 1 3
## HRV099 HRV104 OUT015
## 2 2 1
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"slowResp"] <- 1 # flagging slow respondents (3+ responses with 1+ items > 50 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"slowResp"])) # No. of slow respondents (N = 2)
## 0 1
## 102 2
# fast response times (N = 74)
summary(as.factor(ESM.RT$NiRT.fast)) # summary of number of fast response times
## 0 1 2
## 1288 39 35
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$iRT.fast==1,"ID"])))) # summary of fast respondents
## HRV006 HRV008 HRV013 HRV014 HRV020 HRV022 HRV024 HRV039 HRV041 HRV044 HRV049
## 1 4 3 2 1 6 2 1 1 1 1
## HRV050 HRV056 HRV061 HRV065 HRV072 HRV073 HRV082 HRV087 HRV088 HRV089 HRV093
## 1 1 1 4 1 2 4 2 1 1 3
## HRV098 HRV099 HRV100 HRV102 HRV103 OUT015 OUT046
## 2 11 5 2 2 1 7
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"fastResp"] <- 1 # flagging fast respondents (i.e., 3+ responses with 1+ items > 50 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"fastResp"])) # No. of fast respondents (N = 8 + 1 excluded)
## 0 1
## 95 9
Second, the same procedures used with item response times are applied
considering the total response time of each survey
RT.tot, computed in section 2.1 as the seconds between the
first response and the SubmissionTimestamp of each survey.
Note that here we do not consider those cases with missing response to
the last items (i.e., from PE on out).
We can see that responding to the whole questionnaire
required about 1 min in most cases (median = 61.2 sec). A small
number of questionnaires were responded in more than 2 minutes (7.3%)
with a very few cases requiring more than 3 min (N = 29, 2.13%, flagged
as tRT.slow = 1). Likewise, a small number of
questionnaires were responded in less than 40 second (8.2%) with a very
few cases requiring less than 35 seconds (N = 27, 2%, flagged as
tRT.fast = 1). We also flag careless
participants as those with 3+ responses that took more than 5
min (N = 1 flagged as slowResp.tot = 1) and 3+ responses
that took less than 35 seconds (N = 3 flagged as
fastResp.tot = 1).
# summarizing total response times (in both seconds and minutes)
ESM.RT$RT.tot.min <- ESM.RT$RT.tot/60
summary(ESM.RT$RT.tot); summary(ESM.RT$RT.tot.min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.87 48.09 61.20 75.17 81.59 4412.12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4479 0.8016 1.0199 1.2529 1.3598 73.5354
# plotting response times
grid.arrange(ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+ggtitle("Total response times (min)"),
ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+xlim(0,10)+ggtitle("zooming 0 - 10 min"),
ggplot(ESM.RT,aes(RT.tot.min))+geom_histogram(bins=300)+xlim(0,5)+ggtitle("zooming 0 - 5 min"),
nrow=1)
# No. of total response times higher than X minutes
for(value in c(2,3,5,10,20,50)){ cat("\n-",nrow(ESM.RT[ESM.RT$RT.tot.min>value,]),"total RTs >",value,
"min (",round(100*nrow(ESM.RT[ESM.RT$RT.tot.min>value,])/
nrow(ESM.RT),2),"% )") }
##
## - 99 total RTs > 2 min ( 7.27 % )
## - 29 total RTs > 3 min ( 2.13 % )
## - 9 total RTs > 5 min ( 0.66 % )
## - 2 total RTs > 10 min ( 0.15 % )
## - 2 total RTs > 20 min ( 0.15 % )
## - 1 total RTs > 50 min ( 0.07 % )
# No. of total response times lower than X seconds
for(value in c(40,35,30,20)){ cat("\n-",nrow(ESM.RT[!is.na(ESM.RT$PE) & ESM.RT$RT.tot<value,]),"total RTs <",value,
"sec (",round(100*nrow(ESM.RT[!is.na(ESM.RT$PE) &
ESM.RT$RT.tot<value,])/
nrow(ESM.RT[!is.na(ESM.RT$PE),]),2),"% )") }
##
## - 111 total RTs < 40 sec ( 8.21 % )
## - 27 total RTs < 35 sec ( 2 % )
## - 1 total RTs < 30 sec ( 0.07 % )
## - 0 total RTs < 20 sec ( 0 % )
# flagging slow > 5 min (N = 29) and fast total response times < 35 sec (N = 27)
ESM.RT$tRT.slow <- ESM.RT$tRT.fast <- ESM.RT$slowResp.tot <- ESM.RT$fastResp.tot <- 0
ESM.RT[ESM.RT$RT.tot.min > 3,"tRT.slow"] <- 1
ESM.RT[!is.na(ESM.RT$PE) & ESM.RT$RT.tot < 35,"tRT.fast"] <- 1
# flagging slow and fast respondents
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$tRT.slow==1,"ID"])))) # summary of slow respondents
## HRV001 HRV003 HRV013 HRV050 HRV051 HRV054 HRV056 HRV063 HRV067 HRV070 HRV073
## 1 1 1 1 2 1 1 1 1 1 1
## HRV075 HRV078 HRV079 HRV080 HRV085 HRV088 HRV089 HRV092 HRV094 HRV099 HRV104
## 1 1 1 1 1 1 1 1 4 2 1
## OUT015 OUT029
## 1 1
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"slowResp.tot"] <- 1 # flagging slow respondents (3+ responses > 5 min)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"slowResp.tot"])) # No. of slow respondents (N = 1)
## 0 1
## 103 1
(s <- summary(as.factor(as.character(ESM.RT[ESM.RT$tRT.fast==1,"ID"])))) # summary of fast respondents
## HRV008 HRV018 HRV019 HRV022 HRV036 HRV043 HRV049 HRV061 HRV063 HRV065 HRV071
## 2 1 5 1 1 1 1 1 1 2 1
## HRV084 HRV087 HRV099 OUT046
## 2 1 3 4
ESM.RT[ESM.RT$ID %in% names(s[s >= 3]),"fastResp.tot"] <- 1 # flagging fast respondents (3+ responses < 35 sec)
summary(as.factor(ESM.RT[!duplicated(ESM.RT$ID),"fastResp.tot"])) # No. of slow respondents (N = 3)
## 0 1
## 101 3
Third, we examine those cases with extreme long string of
identical responses to mood items, that is extremely consistent
responses with long strings of identical value in consecutive items
using the same response scale (see Curran (2016)).
This is done using the careless package. Note that in
section 2.2.3 we recoded mood ratings to have the same polarity. Event
items are not considered because most respondents entered no events
(i.e., 1, 1).
We can see that in most cases (about 71%) the same rating was entered
for up to 3 consecutive items, with the 14% showing strings of
consistent responses repeated more than 4 times (i.e., the rule of thumb
proposed by Curran (2016) of identical string length
> scale range / 2). In a minority of cases (N = 57, 3.4%),
respondents entered the same value for 7+ times, and we mark them as
ls = 1. As done previously, we also flag careless
participants (N = 4) as those with 3+ responses with long
identical strings (N = 18 flagged as lsResp = 1).
# computing the longest string frequency and the average length of identical consecutive responses
mood <- colnames(ESM)[which(colnames(ESM)=="v1"):which(colnames(ESM)=="f3")]
ESM.LS <- cbind(ESM[,c("ID","RunTimestamp","within.day")],longstring(ESM[,mood],avg=TRUE))
# plotting longest identical string and average identical string length
grid.arrange(ggplot(ESM.LS,aes(longstr))+geom_histogram(bins=30)+ggtitle("Longest identical string in each survey"),
ggplot(ESM.LS,aes(avgstr))+geom_histogram(bins=30)+ggtitle("Average identical string length"),
nrow=1)
# No. of longest identical strings longer than X
for(value in 3:6){ cat("\n-",nrow(ESM.LS[ESM.LS$longstr>value,]),"longest identical string >",value,
"times (",round(100*nrow(ESM.LS[ESM.LS$longstr>value,])/nrow(ESM.LS),2),"% )") }
##
## - 477 longest identical string > 3 times ( 28.51 % )
## - 234 longest identical string > 4 times ( 13.99 % )
## - 150 longest identical string > 5 times ( 8.97 % )
## - 57 longest identical string > 6 times ( 3.41 % )
# flagging cases with identical strings > 6 (N = 57)
ESM.LS$lsResp <- ESM.LS$ls <- 0
ESM.LS[ESM.LS$longstr > 6,"ls"] <- 1
summary(as.factor(ESM.LS$ls)) # No. of cases with long string (N = 57)
## 0 1
## 1616 57
# flagging long string respondents (N = 4)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$ls == 1,"ID"])))) # No. of long string responses per respondent
## HRV004 HRV005 HRV006 HRV007 HRV008 HRV021 HRV023 HRV030 HRV033 HRV034 HRV036
## 1 4 1 2 2 1 1 1 2 2 2
## HRV038 HRV039 HRV040 HRV045 HRV050 HRV052 HRV054 HRV065 HRV067 HRV068 HRV070
## 3 1 1 1 1 1 1 1 1 1 2
## HRV075 HRV078 HRV081 HRV085 HRV086 HRV088 HRV091 HRV099 HRV100 HRV102 HRV104
## 1 2 1 1 1 2 6 1 2 1 3
## HRV105 OUT015
## 2 1
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"lsResp"] <- 1 # flagging slow respondents (i.e., 2+ identical strings > 5)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"lsResp"]))
## 0 1
## 100 4
Fourth, we examine those cases with unlikely response
patterns by performing a multivariate outlier
analysis based on the Mahalanobis distance. We can see that 72
responses (4.3%) were flagged as multivariate outliers
(md = 1) with a confidence level of 99%, with 6
participants showing 3+ flagged responses and thus marked with
mdResp = 1.
# computing Mahalanobis distance in mood ratings
ESM.LS <- cbind(ESM.LS,mahad(ESM[,mood],flag=TRUE))
# plotting Mahalanobis distance and marking flagged cases (N = 72)
ESM.LS$md <- as.numeric(gsub("FALSE","0",gsub("TRUE","1",ESM.LS$flagged)))
summary(as.factor(ESM.LS$md)) # No. flagged responses (N = 72)
## 0 1
## 1601 72
ggplot(ESM.LS,aes(d_sq,fill=as.factor(md)))+geom_histogram(bins=30)+ggtitle("Mahalanobis distance")
# flagging outlier respondents (N = 6)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$md == 1,"ID"])))) # No. of flagged responses per respondent
## HRV002 HRV004 HRV005 HRV008 HRV013 HRV014 HRV020 HRV026 HRV030 HRV031 HRV036
## 1 1 1 1 2 1 1 1 1 2 4
## HRV037 HRV043 HRV048 HRV049 HRV051 HRV053 HRV056 HRV060 HRV061 HRV062 HRV063
## 1 1 1 1 6 1 1 1 2 1 1
## HRV068 HRV069 HRV071 HRV076 HRV079 HRV082 HRV083 HRV084 HRV086 HRV087 HRV088
## 2 1 1 1 1 1 1 1 4 1 1
## HRV089 HRV093 HRV094 HRV096 HRV098 HRV100 HRV102 HRV103 OUT011 OUT015 OUT059
## 1 1 6 3 1 2 1 1 1 1 2
## OUT066
## 4
ESM.LS$mdResp <- 0
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"mdResp"] <- 1 # flagging outlier respondents (i.e., 3+ extreme Mahalanobis dist)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"mdResp"])) # No. flagged respondents (N = 6)
## 0 1
## 98 6
Fifth, we examine those cases with negative correlations
among psychometric synonyms, that is cases where the
intra-response correlation is negative, and specifically less than -0.30
between pairs of items that correlate at r = .60 or stronger.
We can see that 87 responses (5.3%) were flagged as
psychsyn = 1, with 5 participants showing 3+ flagged
responses and thus marked with psychsynResp = 1.
# computing Mahalanobis distance in mood ratings
ESM.LS <- cbind(ESM.LS,psychsyn(ESM[,mood],diag=TRUE,critval=0.60,resample_na=TRUE))
# note that some cases marked with long identical string have SD = 0 so the intra-response correlation cannot be computed
head(cbind(ESM.LS[is.na(ESM.LS$cor),c("ID","longstr","avgstr","ls","lsResp")],ESM[is.na(ESM.LS$cor),mood]))
# flagging and plotting correlations < -.30 (N = 90)
ESM.LS$psychsyn <- 0
ESM.LS[!is.na(ESM.LS$cor) & ESM.LS$cor < (-.30),"psychsyn"] <- 1
summary(as.factor(ESM.LS$psychsyn)) # No flagged responses (N = 90)
## 0 1
## 1582 91
ggplot(ESM.LS,aes(cor,fill=as.factor(psychsyn)))+geom_histogram(bins=100)+ggtitle("Indivual corr. among psychometric synonyms")
# flagging psychsyn respondents (N = 5)
(s <- summary(as.factor(as.character(ESM.LS[ESM.LS$psychsyn == 1,"ID"])))) # No. of flagged responses per respondent
## HRV001 HRV003 HRV004 HRV005 HRV006 HRV008 HRV012 HRV013 HRV014 HRV017 HRV020
## 1 1 1 2 1 1 1 2 1 2 1
## HRV021 HRV024 HRV028 HRV032 HRV035 HRV037 HRV045 HRV048 HRV051 HRV054 HRV056
## 1 1 3 2 2 2 1 2 2 1 1
## HRV060 HRV064 HRV067 HRV069 HRV071 HRV073 HRV074 HRV075 HRV076 HRV078 HRV079
## 1 1 2 3 3 3 2 1 2 2 1
## HRV080 HRV081 HRV082 HRV084 HRV085 HRV086 HRV088 HRV089 HRV090 HRV092 HRV093
## 1 2 1 2 1 2 2 1 1 1 2
## HRV094 HRV095 HRV098 HRV099 HRV100 HRV101 HRV102 HRV103 HRV104 HRV105 OUT015
## 2 1 1 1 2 3 1 1 1 1 1
## OUT027 OUT046 OUT066
## 3 2 2
ESM.LS$psychsynResp <- 0
ESM.LS[ESM.LS$ID %in% names(s[s >= 3]),"psychsynResp"] <- 1 # flagging outlier respondents (i.e., 3+ flagged responses)
summary(as.factor(ESM.LS[!duplicated(ESM.LS$ID),"psychsynResp"])) # No. flagged participants (N = 4 + 1 excluded)
## 0 1
## 98 6
Here, we summarize the number of potentially careless responses and
participants for each criterion, and we compute an aggregated variable
indicating the number of criteria (from 1 to 5) based on which each
response and participant is classified as potentially careless. Then, we
flag all responses classified as potentially careless
based on 1+ criteria (N = 317, of which 293 from included participants)
with cLess1 = 1 and those based on 2+ criteria (N = 57, of
which 51 from included participants) with cLess2 = 1,
respectively. Similarly, we flag all participants
classified as potentially careless based on 1+ criteria (N = 26, of
which 23 included participants) with cLessResp1 = 1, and
those based on 2+ criteria (N = 4, of which 3 included participants)
with cLessResp2 = 1.
# joining careless criteria
ESM.LS$idday <- as.factor(paste(ESM.LS$ID,ESM.LS$RunTimestamp,sep="_"))
ESM.careless <- join(ESM.LS[,c("ID","within.day","idday","ls","md","psychsyn","lsResp","mdResp","psychsynResp")],
ESM.RT[,c("idday","iRT.fast","tRT.fast","iRT.slow","tRT.slow",
"fastResp","fastResp.tot","slowResp","slowResp.tot")],by="idday",type="left")
# computing aggregate careless scores
ESM.careless$cLessResp2 <- ESM.careless$cLessResp1 <- ESM.careless$cLess2 <- ESM.careless$cLess1 <- 0
cr <- c("ls","md","psychsyn","iRT.fast","tRT.fast","iRT.slow","tRT.slow")
ESM.careless$carelessN <- rowSums(ESM.careless[,cr],na.rm=TRUE)
ESM.careless[ESM.careless$carelessN>=1,"cLess1"] <- 1 # careless response if meeting 1+ criteria
ESM.careless[ESM.careless$carelessN>=2,"cLess2"] <- 1 # careless response if meeting 2+ criteria
crResp <- c("lsResp","mdResp","psychsynResp","fastResp","fastResp.tot","slowResp","slowResp.tot")
ESM.careless$cLessRespN <- rowSums(ESM.careless[,crResp],na.rm=TRUE)
ESM.clResp <- ESM.careless[ESM.careless$within.day!=0,]
ESM.clResp <- ESM.clResp[!duplicated(ESM.clResp$ID),]
ESM.clResp[ESM.clResp$cLessRespN>=1,"cLessResp1"] <- 1 # careless respondent if meeting 1+ criteria
ESM.clResp[ESM.clResp$cLessRespN>=2,"cLessResp2"] <- 1 # careless respondent if meeting 1+ criteria
# summarizing criteria of potentially careless responses (N = 51)
summary(as.data.frame(lapply(ESM.careless[,c(cr,"carelessN","cLess1","cLess2")],as.factor))) # all in
## ls md psychsyn iRT.fast tRT.fast iRT.slow tRT.slow
## 0:1616 0:1601 0:1582 0 :1285 0 :1332 0 :1326 0 :1330
## 1: 57 1: 72 1: 91 1 : 74 1 : 27 1 : 33 1 : 29
## NA's: 314 NA's: 314 NA's: 314 NA's: 314
##
##
## carelessN cLess1 cLess2
## 0:1356 0:1356 0:1616
## 1: 260 1: 317 1: 57
## 2: 50
## 3: 5
## 4: 2
summary(as.data.frame(lapply(ESM.careless[substr(ESM.careless$ID,1,3)!="OUT",
c(cr,"carelessN","cLess1","cLess2")],as.factor))) # only included participants
## ls md psychsyn iRT.fast tRT.fast iRT.slow tRT.slow
## 0:1544 0:1536 0:1517 0 :1240 0 :1283 0 :1274 0 :1279
## 1: 56 1: 64 1: 83 1 : 66 1 : 23 1 : 32 1 : 27
## NA's: 294 NA's: 294 NA's: 294 NA's: 294
##
##
## carelessN cLess1 cLess2
## 0:1307 0:1307 0:1549
## 1: 242 1: 293 1: 51
## 2: 46
## 3: 3
## 4: 2
# summarizing criteria of potentially careless respondents (N = 16)
summary(as.data.frame(lapply(ESM.clResp[,c(crResp,"cLessRespN","cLessResp1","cLessResp2")],as.factor))) # all in
## lsResp mdResp psychsynResp fastResp fastResp.tot slowResp slowResp.tot
## 0:100 0:98 0:98 0:95 0:101 0:102 0:103
## 1: 4 1: 6 1: 6 1: 9 1: 3 1: 2 1: 1
##
##
## cLessRespN cLessResp1 cLessResp2
## 0:78 0:78 0:100
## 1:22 1:26 1: 4
## 2: 3
## 3: 1
summary(as.data.frame(lapply(ESM.clResp[substr(ESM.clResp$ID,1,3)!="OUT",
c(crResp,"cLessRespN","cLessResp1","cLessResp2")],as.factor))) # only included
## lsResp mdResp psychsynResp fastResp fastResp.tot slowResp slowResp.tot
## 0:93 0:92 0:92 0:89 0:95 0:95 0:96
## 1: 4 1: 5 1: 5 1: 8 1: 2 1: 2 1: 1
##
##
## cLessRespN cLessResp1 cLessResp2
## 0:74 0:74 0:94
## 1:20 1:23 1: 3
## 2: 2
## 3: 1
# joining careless scores to the ESM dataset
ESM$idday <- as.factor(paste(ESM$ID,ESM$RunTimestamp,sep="_"))
ESM <- join(ESM,ESM.careless[,c("idday","cLess1","cLess2")],by="idday",type="left")
ESM <- join(ESM,ESM.clResp[,c("ID","cLessResp1","cLessResp2")],by="ID",type="left")
Here, we visually inspect all responses marked with
cLess2 = 1. We can note several cases of long string
proportions (e.g., HRV013_2_3) as well as cases of
inconsistent responses (e.g., HRV060_1_0 for items
t2 and t3).
ESM$id <- as.factor(paste(ESM$ID,ESM$day.of.week,ESM$within.day,sep="_"))
plotResp(ESM[ESM$cLess2==1,],"id",mood)
Here, we visually inspect all responses from participants marked with
cLessResp2 = 2. Also in this case, we can note several
cases of long string proportions (e.g., HRV051_2_4) and
especially cases of inconsistent responses (e.g.,
HRV051_1_1 for items f2 and
f3).
plotResp(ESM[ESM$cLessResp2==1,],"id",mood)
Here, we describe the steps implemented to pre-process the pulse interval data recorded with the E4 wristband (Empatica, Milan). Then, pre-processed pulse intervals are read and used for computing HR and HRV metrics, which are subsequently recoded and filtered based on data quality and inclusion criteria.
The following signal pre-processing steps were implemented to derive pulse intervals from the raw blood volume pulse (BVP):
We selected the recording segments between each couple of markers
signaled by participants by focusing on the last 2
minutes of each 3-minute resting interval, and we used
consecutive pulse wave foot points in the BVP signal to
automatically detect pulse intervals. This was done with the Pulse Intervals Detection
App - PIDApp, based on the pracma and the shiny R packages.
Artefacts were removed based on visual
inspection of both the BVP signal and pulse intervals time
series, and recording segments with artefacts in more than the
25% were excluded from the analyses (marked as
out). Specifically, a given pulse intervals was classified
as artefact and removed if meeting three out of five
criteria:
clear noise in the related raw BVP signal
clear rise in the amplitude of the corresponding acceleration signal
excessive deviation (i.e., > 300 ms) from the mean pulse interval length of the surrounding 10 pulse intervals
excessive deviation (i.e., > 10 bpm) from the instantaneous HR values preceding and following the pulse interval
excessive deviation (i.e., > 1 ms) between the RMSSD value computed with and without considering the pulse interval
The resulting pulse interval time series were automatically processed using the freely available ARTiiFACT software (version 2.13) (Kaufmann et al., 2011). Specifically, we spline interpolated pulse intervals identified as artefacts based on the median absolute deviation method.
A final visual inspection was performed based on
the code showed below (not run), and the pre-processed data were saved
in the HRV/FinalProcessing folder.
# Before executing the following code, pulse intervals were pre-processed as described in steps 1-3.
# Each pre-processed segment was named as "subject_day_interval_Number of artefacts".
# Intervals discarded due to too many artefacts were marked with the label "out" in place of the number of artefacts.
# Missing intervals (i.e., not signaled by participants) were marked as "miss"
# ..............................................
# 4.1. DATA VISUALIZATION AND COMPARISON
# ..............................................
# Here, we show the code (not run) used for visually inspecting the pulse intervals time series processed with ARTiiFACT.
# Automatically processed pulse intervals obtained in step 3 were saved in the "HRV/ARTIIFACTcorrected" folder.
# Automatically processed pulse intervals are also compared them with the original (i.e., manually processed) data.
# Manually processed pulse intervals obtained in step 2 were saved in the "HRV/clean_noOUT" folder.
# reading file names
data.path <- "HRV/ARTIIFACTcorrected" # files processed with ARTiiFACT
paths = list.files(data.path, recursive = TRUE,full.names = TRUE, include.dirs = FALSE, pattern = ".txt")
data.path.original <- "HRV/clean_noOUT" # original data
paths.original = list.files(data.path.original, recursive = TRUE,full.names = TRUE, include.dirs = FALSE,pattern=".csv")
# isolating ID x day couples to be plotted
IDs <- strsplit(gsub(".txt","",gsub("_ArtifactCorrectedWithMethod_1.txt","",gsub("/","",gsub(data.path,"",paths)))),"_")
IDdays <- levels(as.factor(paste(rep(levels(as.factor(unlist(IDs)[grep("HRV",unlist(IDs))])),3),
c(rep(1,length(IDs)),rep(2,length(IDs)),rep(3,length(IDs))),sep="_")))
# reading, processing, plotting, and saving pulse intervals
for(IDday in IDdays[260:length(IDdays)]){
paths.cor <- grep(IDday,paths,value=TRUE)
paths.or <- grep(IDday,paths.original,value=TRUE)
# setting plot interface
par(mfrow=c(2,4),mar=c(2,2,3,2))
# dataset of HR and HRV
Summ <- data.frame(interval=NA,RMSSD.cor=NA,RMSSD.or=NA,HR.cor=NA,HR.or=NA)
for(i in 1:length(paths.cor)){
# matching processed and original data
ID <- gsub(".csv","",gsub("_ArtifactCorrectedWithMethod_1.txt","",gsub("/","",gsub(data.path,"",paths.cor[i]))))
new.data <- read.csv(paths.cor[i],header=FALSE)
colnames(new.data)[1] <- "IBI"
original.data <- read.csv(paths.or[i])
colnames(original.data)[1] <- "time"
# adding info to dataset
Summ <- rbind(Summ,data.frame(interval=as.integer(strsplit(ID,"_")[[1]][3]),
RMSSD.cor=round(sqrt(mean(diff(new.data$IBI)^2)),2),
RMSSD.or=round(sqrt(mean(diff(original.data$IBI)^2)),2),
HR.cor=round(mean(60000/new.data$IBI),2),HR.or=round(mean(60000/original.data$IBI),2)))
# plotting corrected over original IBIs
plot(original.data[1:nrow(original.data),"time"],60000/original.data[1:nrow(original.data),"IBI"],
type="o",pch=20,cex=1.5,cex.main=1.2,xlab=NA,ylab=NA,
main=paste(ID,"\nRMSSD =",Summ[i+1,3],"-->",Summ[i+1,2],"- HR =",Summ[i+1,5],"-->",Summ[i+1,4]))
lines(original.data[1:nrow(original.data),"time"],60000/new.data$IBI,
type="o",pch=20,cex=1.5,col="red")
# saving corrected data with time information
write.csv(data.frame(time=original.data$time,IBI=new.data$IBI),gsub("clean_noOUT","SanityCheck",paths.or[i])) }
# plotting RMSSD by interval
plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD.or"],main=paste("RMSSD (ms)",IDday),type="o")
lines(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD.cor"],type="o",col="red")
# plotting HR by interval
plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR.or"],main=paste("HR (bpm)",IDday),type="o")
lines(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR.cor"],type="o",col="red")
readline(prompt=cat(IDday,"\nPress [enter]")) }
# Based on this visual inspection, some signal segments were manually re-processed by interpolating artefacted pulse intervals.
# Specifically, we only kept corrections leading to substantial absolute change (i.e., > 1 ms) in the RMSSD of each segment.
# Manually re-processed files were saved in the "HRV/SanityCheck2" folder.
# ..............................................
# 4.2. FINAL VISUAL INSPECTION AND DATA EXPORT
# ..............................................
# As a final step, we conducted a last visual inspection and we saved the final data in the "HRV/FinalProcessing" folder.
# reading file names (i.e., "subject_day_interval_Number of artefacts")
data.path <- "HRV/SanityCheck2" # automatically processed & checked data
paths = list.files(data.path, recursive = TRUE,
full.names = TRUE, include.dirs = FALSE, pattern = ".csv")
# reading files, recoding, plotting, and saving
for(IDday in IDdays[201:length(IDdays)]){
# setting plot interface
par(mfrow=c(2,4),mar=c(2,2,3,2))
# dataset of HR and HRV
Summ <- data.frame(interval=NA,RMSSD=NA,HR=NA)
# reading each interval in a given IDday
paths.IDday <- grep(IDday,paths,value=TRUE)
for(i in 1:length(paths.IDday)){
# changing participants' ID
ID <- strsplit(gsub(" ","",(gsub("/","",gsub(".csv","",(gsub(data.path,"",paths.IDday[i])))))),split = "_")
if(nchar(ID[[1]][1])==5){
ID[[1]][1] <- paste("HRV0",substr(ID[[1]][1],4,5),sep="")
ID <- paste(ID[[1]][1],ID[[1]][2],ID[[1]][3],ID[[1]][4],sep="_")
} else{ ID <- gsub("/","",gsub(".csv","",(gsub(data.path,"",paths.IDday[i])))) }
if(strsplit(ID,"_")[[1]][4]!="out" & strsplit(ID,"_")[[1]][4]!="miss"){
# changing data format
new.data <- read.csv(paths.IDday[i])
if(colnames(new.data)[1]=="X"){ new.data <- new.data[,2:3] }
if(colnames(new.data)[1]=="time"){colnames(new.data)[1]="t"}
# adding info to dataset
Summ <- rbind(Summ,data.frame(interval=as.integer(strsplit(ID,"_")[[1]][3]),
RMSSD=round(sqrt(mean(diff(new.data$IBI)^2)),2),
HR=round(mean(60000/new.data$IBI),2)))
# plotting corrected over original IBIs
plot(new.data$t,60000/new.data$IBI,
type="o",pch=20,cex=1.5,cex.main=1.2,xlab=NA,ylab=NA,
main=paste(ID,"\nRMSSD =",Summ[!is.na(Summ$interval)&Summ$interval==i,2],
", HR =",Summ[!is.na(Summ$interval)&Summ$interval==i,3]))
# saving corrected data with time information
write.csv(new.data,paste("HRV/FinalProcessing/",ID,".csv",sep=""),row.names=FALSE)
} else{ # cases marked as "out" or "miss"
plot(1,main=paste(ID,"\nRMSSD =",Summ[i+1,2],"- HR =",Summ[i+1,3])) }} # empty plot
# plotting RMSSD by interval
plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"RMSSD"],main=paste("RMSSD (ms)",IDday),type="b",pch=20,cex=5)
plot(Summ[2:nrow(Summ),"interval"],Summ[2:nrow(Summ),"HR"],main=paste("HR (bpm)",IDday),type="b",pch=20,cex=5)
readline(prompt=cat(IDday,"\nPress [enter]")) }
Here, we we read the pre-processed pulse intervals time series and we compute the following metrics for each two-minute recording segment:
Mean pulse interval PI, computed as
the mean of the pulse intervals (ms) in a given recording
segment
Mean heart rate HR (bpm), computed
as the mean of 60000/PI
Root mean square of successive differences in N-N
intervals RMSSD (ms), computed as the squared root
of the squared mean successive differences in consecutive pulse
intervals
RMSSD coefficient of variation
RMSSDcv, computed as 100 * RMSSD / PI,
following de Geus et al (2019)’s recommendation to
test and report on how adjusting HRV metrics for chronotropic state
affects primary study outcomes
The readHRV function is used to process pulse intervals
and merge them into the HRV dataset.
readHRV
readHRV <- function(data.path){
# empty data.frame to be filled
HRVdata <- data.frame(ID=NA,day=NA,within.day=NA,PI=NA,HR=NA,RMSSD=NA,RMSSDcv=NA,artefacts=NA,nPI=NA)
paths = list.files(data.path, recursive = TRUE, full.names = TRUE, include.dirs = FALSE)
# data reading
for(path in paths){
ID <- gsub("/","",gsub(data.path,"",gsub(".csv","",path)))
ID <- strsplit(ID,"_")[[1]]
if(ID[4]=="out"|ID[4]=="miss"){ # missing or excluded data
HRVdata <- rbind(HRVdata,
data.frame(ID=ID[1],day=ID[2],within.day=ID[3],
PI=NA,HR=NA,RMSSD=NA,RMSSDcv=NA,
artefacts=ID[4],nPI=NA))
}else{ # complete data
new.data <- read.csv(path)
HRVdata <- rbind(HRVdata,
# computing HRV metrics
data.frame(ID=ID[1],day=ID[2],within.day=ID[3],
PI=mean(new.data$IBI), # mean pulse interval
HR=mean(60000/new.data$IBI), # mean heart rate
RMSSD=sqrt(mean(diff(new.data$IBI)^2)), # RMSSD
RMSSDcv= 100 * sqrt(mean(diff(new.data$IBI)^2)) / mean(new.data$IBI), # RMSSDcv
artefacts=ID[4],
nPI=nrow(new.data)))}}
# adjusting data
HRVdata <- HRVdata[2:nrow(HRVdata),] # removing first empty row
HRVdata[,c("ID","day","within.day")] <- lapply(HRVdata[,c("ID","day","within.day")],as.factor)
return(HRVdata)}
(HRV <- readHRV(data.path="HRV/FinalProcessing"))[1:3,] # reading data and showing first three rows
HRV data were already cleaned and filtered in the signal pre-processing step (see section 3.1). Here, we report the details on data quality and we consider some data quality indicators as further criteria to flag those cases that might bias the results. The following packages and functions are used to optimize data filtering:
library(plyr) # loading required packages
complRate
complRate <- function(data,what=c("compliance","inclusion")){
# scheduled No. of surveys (1 baseline + 5 ordinary X 3 days)
tot <- (1 + 5) * 3
amb <- 5 * 3
lab <- 1 * 3
# subsetting data
data.ok <- data[!(data$artefacts%in%c("out","miss")),] # clean segments
data.nmiss <- data[data$artefacts!="miss",] # nonmissing segments
data.miss <- data[data$artefacts=="miss",] # missing segments
data.out <- data[data$artefacts=="out",] # excluded segments
# creating dataset of compliance and inclusion rates
cr <- data.frame(ID=levels(data$ID),tot.nmiss=as.numeric(table(data.nmiss$ID))) # No. of nonmissing segments per participant
cr$lab.nmiss <- as.numeric(table(data.nmiss[data.nmiss$within.day==6,"ID"])) # No. nonmissing segments in lab
cr$amb.nmiss <- as.numeric(table(data.nmiss[data.nmiss$within.day!=6,"ID"])) # No. nonmissing segments outside lab
# compliance: nonmissing and missing over scheduled
if("compliance" %in% what){
# computing compliance rate
cr$tot.CR <- round(100*cr$tot.nmiss/tot,1) # compliance rate
cr$lab.CR <- round(100*cr$lab.nmiss/lab,1) # compliance rate in lab
cr$amb.CR <- round(100*cr$amb.nmiss/amb,1) # compliance rate outside lab
cat("\n\nNonmissing segments:",
"\n -",sum(cr$tot.nmiss),"nonmissing segments (",
round(100*sum(cr$tot.nmiss)/(tot*nlevels(data$ID)),1),"% of tot scheduled ), from",min(cr$tot.nmiss),"to",max(cr$tot.nmiss),
", mean =",round(mean(cr$tot.nmiss),1),"sd =",round(sd(cr$tot.nmiss),1),
", mean perc =",round(mean(cr$tot.CR),1),"% sd =",round(sd(cr$tot.CR),1),"%",
"\n -",sum(cr$amb.nmiss),"from outside lab, from",min(cr$amb.nmiss),"to",max(cr$amb.nmiss),
", mean =",round(mean(cr$amb.nmiss),1),"sd =",round(sd(cr$amb.nmiss),1),
", mean perc =",round(mean(cr$amb.CR),1),"% sd =",round(sd(cr$amb.CR),1),"%",
"\n -",sum(cr$lab.nmiss),"from lab, from",min(cr$lab.nmiss),"to",max(cr$lab.nmiss),
", mean =",round(mean(cr$lab.nmiss),1),"sd =",round(sd(cr$lab.nmiss),1),
", mean perc =",round(mean(cr$lab.CR),1),"% sd =",round(sd(cr$lab.CR),1),"%")
# computing No. of missing segments and missing rate
cr$tot.miss <- as.numeric(table(data.miss$ID)) # No. of missing segments per participant
cr$tot.MR <- round(100*cr$tot.miss/tot,1) # missing rate
cr$lab.miss <- as.numeric(table(data.miss[data.miss$within.day==6,"ID"])) # No. missing segments in lab
cr$lab.MR <- round(100*cr$lab.miss/lab,1) # missing rate in lab
cr$amb.miss <- as.numeric(table(data.miss[data.miss$within.day!=6,"ID"])) # No. missing segments outside lab
cr$amb.MR <- round(100*cr$amb.miss/amb,1) # missing rate outside lab
cat("\n\nMissing segments:",
"\n -",sum(cr$tot.miss),"missing segments (",
round(100*sum(cr$tot.miss)/(tot*nlevels(data$ID)),1),"% of tot scheduled ), from",min(cr$tot.miss),"to",max(cr$tot.miss),
", mean =",round(mean(cr$tot.miss),1),"sd =",round(sd(cr$tot.miss),1),
", mean perc =",round(mean(cr$tot.MR),1),"% sd =",round(sd(cr$tot.MR),1),"%",
"\n -",sum(cr$amb.miss),"from outside lab, from",min(cr$amb.miss),"to",max(cr$amb.miss),
", mean =",round(mean(cr$amb.miss),1),"sd =",round(sd(cr$amb.miss),1),
", mean perc =",round(mean(cr$amb.MR),1),"% sd =",round(sd(cr$amb.MR),1),"%",
"\n -",sum(cr$lab.miss),"from lab, from",min(cr$lab.miss),"to",max(cr$lab.miss),
", mean =",round(mean(cr$lab.miss),1),"sd =",round(sd(cr$lab.miss),1),
", mean perc =",round(mean(cr$lab.MR),1),"% sd =",round(sd(cr$lab.MR),1),"%") }
# inclusion: excluded and included over nonmissing
if("inclusion" %in% what){
# computing No. of excluded segments and exclusion rate
cr$tot.out <- as.numeric(table(data.out$ID)) # No. of excluded segments per participant
cr$tot.ER <- round(100*cr$tot.out/cr$tot.nmiss,1) # exclusion rate
cr$lab.out <- as.numeric(table(data.out[data.out$within.day==6,"ID"])) # No. excluded segments in lab
cr$lab.ER <- round(100*cr$lab.out/cr$lab.nmiss,1) # exclusion rate in lab
cr$amb.out <- as.numeric(table(data.out[data.out$within.day!=6,"ID"])) # No. excluded segments outside lab
cr$amb.ER <- round(100*cr$amb.out/cr$amb.nmiss,1) # exclusion rate outside lab
cat("\n\nExcluded segments:",
"\n -",sum(cr$tot.out),"excluded segments (",
round(100*sum(cr$tot.out)/sum(cr$tot.nmiss),1),"% of nonmissing segments), from",min(cr$tot.out),"to",max(cr$tot.out),
", mean =",round(mean(cr$tot.out),1),"sd =",round(sd(cr$tot.out),1),
", mean perc =",round(mean(cr$tot.ER),1),"% sd =",round(sd(cr$tot.ER),1),"%",
"\n -",sum(cr$amb.out),"from outside lab, from",min(cr$amb.out),"to",max(cr$amb.out),
", mean =",round(mean(cr$amb.out),1),"sd =",round(sd(cr$amb.out),1),
", mean perc =",round(mean(cr$amb.ER)),"% sd =",round(sd(cr$amb.ER)),"%",
"\n -",sum(cr$lab.out),"from lab, from",min(cr$lab.out),"to",max(cr$lab.out),
", mean =",round(mean(cr$lab.out),1),"sd =",round(sd(cr$lab.out),1),
", mean perc =",round(mean(cr$lab.ER),1),"% sd =",round(sd(cr$lab.ER),1),"%")
# computing No. of included segments and inclusion rate
cr$tot.ok <- as.numeric(table(data.ok$ID)) # No. of included segments per participant
cr$tot.IR <- round(100*cr$tot.ok/cr$tot.nmiss,1) # inclusion rate
cr$lab.ok <- as.numeric(table(data.ok[data.ok$within.day==6,"ID"])) # No. included segments in lab
cr$lab.IR <- round(100*cr$lab.ok/cr$lab.nmiss,1) # inclusion rate in lab
cr$amb.ok <- as.numeric(table(data.ok[data.ok$within.day!=6,"ID"])) # No. included segments outside lab
cr$amb.IR <- round(100*cr$amb.ok/cr$amb.nmiss,1) # inclusion rate outside lab
cat("\n\nIncluded segments:",
"\n -",sum(cr$tot.ok),"excluded segments (",
round(100*sum(cr$tot.ok)/sum(cr$tot.nmiss),1),"% of nonmissing segments), from",min(cr$tot.ok),"to",max(cr$tot.ok),
", mean =",round(mean(cr$tot.ok),1),"sd =",round(sd(cr$tot.ok),1),
", mean perc =",round(mean(cr$tot.IR),1),"% sd =",round(sd(cr$tot.IR),1),"%",
"\n -",sum(cr$amb.ok),"from outside lab, from",min(cr$amb.ok),"to",max(cr$amb.ok),
", mean =",round(mean(cr$amb.ok),1),"sd =",round(sd(cr$amb.ok),1),
", mean perc =",round(mean(cr$amb.IR),1),"% sd =",round(sd(cr$amb.IR),1),"%",
"\n -",sum(cr$lab.ok),"from lab, from",min(cr$lab.ok),"to",max(cr$lab.ok),
", mean =",round(mean(cr$lab.ok),1),"sd =",round(sd(cr$lab.ok),1),
", mean perc =",round(mean(cr$lab.IR),1),"% sd =",round(sd(cr$lab.IR),1),"%") }
return(cr)}
Computes a data frame of compliance rate values and prints summary information on compliance and inclusion rate.
plotPI <- function(data.path,IDs){
paths = list.files(data.path, recursive = TRUE, full.names = TRUE, include.dirs = FALSE)
for(ID in IDs){ pathIDs <- paths[substr(paths,21,26)==ID]
par(mfrow=c(2,4))
for(pathID in pathIDs){ id <- strsplit(gsub("/","",gsub(pathID,"",gsub(".csv","",pathID))),split="_")[[1]][4]
if(id!="out" & id!="miss"){ pi <- read.csv(pathID)
plot(pi$t,60000/pi$IBI,type="o",pch=20,cex=.5,lwd=.5,cex.main=1.2,xlab="time (sec)",ylab="HR (bpm)",
main=gsub(data.path,"",pathID),ylim=c(40,120))}}}}
Reads and plots the pulse intervals time series from the specified subsample of participants.
We start by inspecting the original number of respondents and
observations in the HRV dataset, which corresponds to the
total number of scheduled recordings (i.e., No. participants x 3 days x
6 recordings), that is 1,746. We can see that the dataset includes a
total of 97 participants, eight less than the 105 participants that were
actually invited to participate to the daily protocol. This is because
the eight excluded participants mentioned in sections
1.3.5 and 2.3.3 were not pre-processed or their pre-processing was
interrupted after noticing too low signal quality.
cat("Total No. of scheduled HRV recordings =",nrow(HRV),"from",nlevels(HRV$ID),"participants")
## Total No. of scheduled HRV recordings = 1746 from 97 participants
First, we inspect the data for cases of double recording segments,
that is segments with the same ID, day, and
within.day value. We can see that no double
recordings are included in the HRV dataset.
cat(nrow(HRV[duplicated(paste(HRV$ID,HRV$day,HRV$within.day)),]),"double recordings")
## 0 double recordings
Second, we inspect the compliance ratings and the data quality as further criteria to flag those cases that might bias the results.
First, we inspect the compliance rate, that is the number of
recordings that were not marked as “miss” in the file name over the
total number of scheduled recordings, by using the
complRate function showed above.
We can see that all lab-based segments were recorded each day for each participant (compliance = 100%), whereas a some missing data were present in ambulatory segments (i.e., those recorded outside the lab). Overall, the mean compliance rate with ambulatory recordings was relatively high (92%) with no participants showing response rates < 60%. Thus, we see no reasons for excluding or flagging further participants based on compliance rate.
crate <- complRate(HRV,what="compliance")
##
##
## Nonmissing segments:
## - 1629 nonmissing segments ( 93.3 % of tot scheduled ), from 13 to 18 , mean = 16.8 sd = 1.4 , mean perc = 93.3 % sd = 7.5 %
## - 1338 from outside lab, from 10 to 15 , mean = 13.8 sd = 1.4 , mean perc = 92 % sd = 9 %
## - 291 from lab, from 3 to 3 , mean = 3 sd = 0 , mean perc = 100 % sd = 0 %
##
## Missing segments:
## - 117 missing segments ( 6.7 % of tot scheduled ), from 0 to 5 , mean = 1.2 sd = 1.4 , mean perc = 6.7 % sd = 7.5 %
## - 117 from outside lab, from 0 to 5 , mean = 1.2 sd = 1.4 , mean perc = 8 % sd = 9 %
## - 0 from lab, from 0 to 0 , mean = 0 sd = 0 , mean perc = 0 % sd = 0 %
Second, we inspect the rate of pulse intervals identified as artefacts in the included segments. Note that recording segments with artefacts in more than the 25% of the pulse intervals were excluded and marked as “out”.
We can note that a few included segments (N = 9, 0.7%) were still
presenting artefacts in more than the 25%, and thus we exclude them.
Without these cases, the average No. of removed artefacts per segment is
7.34 (5.2%). Here, we consider a more restrictive criterion by
flagging segments with artefacts in more than the 10%
of the pulse intervals (N = 234, with art10 = 1, 17%).
# separating "miss" and "out" from the No. of artefacts
HRV$out <- HRV$miss <- 0
HRV[HRV$artefacts=="miss","miss"] <- 1
HRV[HRV$artefacts=="out","out"] <- 1
HRV$nArtefacts <- HRV$artefacts
HRV[HRV$artefacts %in% c("miss","out"),"nArtefacts"] <- NA
HRV$nArtefacts <- as.integer(HRV$nArtefacts) # No. of artefacts
HRV$pArtefacts <- 100*HRV$nArtefacts/HRV$nPI # artefact rate
# checking for included cases with artefacts in more than the 25%
cat("Excluding further",nrow(HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,]),"segments with 25%+ artefacts")
## Excluding further 9 segments with 25%+ artefacts
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,"out"] <- 1 # excluding these cases
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,"artefacts"] <- "out"
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,c("PI","HR","RMSSD","RMSSDcv","nPI","nArtefacts","pArtefacts")] <- NA
cat(nrow(HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>25,]),"remaining segments with 25%+ artefacts")
## 0 remaining segments with 25%+ artefacts
# summarizing and plotting No. of artefacts per segment
cat(sum(HRV$nArtefacts,na.rm=TRUE),"artefacts removed from included segments (",
round(100*sum(HRV$nArtefacts,na.rm=TRUE)/sum(HRV$nPI,na.rm=TRUE),1),"% of pulse intervals )",
"\n- No. of artefacts per recording ranging from",min(HRV$nArtefacts,na.rm=TRUE),"to",max(HRV$nArtefacts,na.rm=TRUE),
", mean =",round(mean(HRV$nArtefacts,na.rm=TRUE),2),"sd =",round(sd(HRV$nArtefacts,na.rm=TRUE),2),
"\n- Perc. artefacts per recording ranging from",
min(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),"% to",max(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),"%, mean =",
round(mean(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),1),"% sd =",round(sd(100*HRV$nArtefacts/HRV$nPI,na.rm=TRUE),1),"%")
## 9906 artefacts removed from included segments ( 4.9 % of pulse intervals )
## - No. of artefacts per recording ranging from 0 to 29 , mean = 7.34 sd = 7.61
## - Perc. artefacts per recording ranging from 0 % to 25 %, mean = 5.2 % sd = 5.5 %
par(mfrow=c(1,2)); hist(HRV$nArtefacts,main="No. removed artefacts"); hist(100*HRV$nArtefacts/HRV$nPI,main="% removed artefacts")
# flagging cases with artefacts in more than the 10%
HRV[!is.na(HRV$pArtefacts),"art10"] <- 0
HRV[!is.na(HRV$pArtefacts) & HRV$pArtefacts>10,"art10"] <- 1
summary(as.factor(HRV$art10))
## 0 1 NA's
## 1115 234 397
Third, we inspect the quality of the recorded signal in terms of
number of segment excluded due to excessive artefacts
(i.e., segments with artefacts in more than the 25% of the pulse
intervals were marked as “out” in the file name) over the number
of nonmissing segments. This is done with the same
complRate function showed above.
We can see that only a few lab-based segments (N =
30 out of 291, 10.3%) were excluded due to excessive artefacts, whereas
a higher percentage of ambulatory segments (N = 243 our of 1,338, 18.7%)
were excluded for the same reasons. Overall, 280 segments (17.2% of the
nonmissing segments) were excluded due to excessive artefacts, resulting
in a total of 1,349 included segments. Here, a number
of participants (N = 14, 14.4%) showed inclusion rates < 60%
and are flagged with inclRate60 = 1.
# computing inclusion rate
irate <- complRate(HRV,what="inclusion")
##
##
## Excluded segments:
## - 280 excluded segments ( 17.2 % of nonmissing segments), from 0 to 12 , mean = 2.9 sd = 3.1 , mean perc = 17.2 % sd = 18.1 %
## - 250 from outside lab, from 0 to 10 , mean = 2.6 sd = 2.7 , mean perc = 19 % sd = 20 %
## - 30 from lab, from 0 to 2 , mean = 0.3 sd = 0.6 , mean perc = 10.3 % sd = 19.5 %
##
## Included segments:
## - 1349 excluded segments ( 82.8 % of nonmissing segments), from 6 to 18 , mean = 13.9 sd = 3.2 , mean perc = 82.8 % sd = 18.1 %
## - 1088 from outside lab, from 4 to 15 , mean = 11.2 sd = 2.9 , mean perc = 81.3 % sd = 19.6 %
## - 261 from lab, from 1 to 3 , mean = 2.7 sd = 0.6 , mean perc = 89.7 % sd = 19.5 %
# flagging participants with IR < 60% (N = 14)
irate$inclRate60 <- 0
irate[irate$tot.IR < 60,"inclRate60"] <- 1
HRV <- join(HRV,irate[,c("ID","inclRate60")],by="ID",type="left")
summary(as.factor(irate$inclRate60))
## 0 1
## 83 14
Finally, we inspect the data of 10 included participants
(10.3%) whose signal was overall classified as ‘noisy’ due to
higher uncertainty of pulse interval detection. The plotPI
function showed at the beginning of section 3.3 is used to visualize the
pulse intervals time series for these participants and to compare them
with a random subsample of 5 ‘clean-signal’ participants.
We can note that, compared to ‘clean-signal’ participants, flagged participants tended to show less variable pulse intervals. Some cases (e.g., HRV032) tend to show ‘fixed’ pulse intervals values (i.e., certain values were highly frequent), as partially due to the sampling frequency of the E4 PPG sensor and can be also seen for some unflagged participant. Other cases (e.g., HRV054) show frequent steep changes of about 10-20 bpm.
Here, we flag the 10 participants showing a ‘noisy’ signal with
noisyPI = 1. We can note that the noisyPI
variable looks quite indepentend from the previously created
art10 and inclRate60 variables.
noisy <- c("HRV018","HRV020","HRV021","HRV032","HRV033","HRV050","HRV054","HRV063","HRV070","HRV076")
HRV$noisyPI <- 0
HRV[HRV$ID %in% noisy,"noisyPI"] <- 1
summary(as.factor(HRV[!duplicated(HRV$ID),"noisyPI"]))
## 0 1
## 87 10
# crossing noisyPI cases with art10
table(HRV[,c("noisyPI","art10")])
## art10
## noisyPI 0 1
## 0 1021 199
## 1 94 35
# crossing noisyPI participants with inclRate60
table(HRV[!duplicated(HRV$ID),c("noisyPI","inclRate60")])
## inclRate60
## noisyPI 0 1
## 0 75 12
## 1 8 2
Here, we visualize the pulse intervals of all segments recorded by
participants flagged with noisyPI = 1.
plotPI(data.path="HRV/FinalProcessing",IDs=noisy)
Here, we visualize the pulse intervals of all segments recorded by a
random subsample of 5 participants not flagged with
noisyPI = 1.
plotPI(data.path="HRV/FinalProcessing",IDs=sample(levels(HRV$ID)[!(levels(HRV$ID) %in% noisy)],size=10))
Here, the raw behavioral data from Go/No-Go task
exported from E-Prime (version 2.0.10) (Psychology Software Tools, Inc.,
Sharpsburg, PA) are read and saved in the GNG dataset,
recoded, and filtered based in inclusion criteria.
First, we read the CSV data files exported from the E-Prime software.
Specifically, we used E-Merge to join the data from all the 105
participants invited to participate to the daily protocol, in a single
data file Merged_105_final.csv. Here, we read that
long-form trial-by-trial dataset.
# reading data
GNG <- read.csv2("GONOGO/Merged_105_final.csv",header=TRUE)
Then, we recode the GNG variables to be used for the
analyses.
The following packages and functions are used to optimize the ESM data recoding:
library(Rmisc)
GNGrecoding
GNGrecoding <- function(data){
# VARIABLE RECORDING
# .............................................................
# keeping only columns of interest
data <- data[c("ID","Session","Procedure.Block.","SessionDate","SessionTime","TrialType","Image","Jitter", # session info
paste("Stimulus",rep(1:6,each=2),".",rep(c("ACC","RT"),6),sep=""))] # accuracy and reaction times
# recoding block information
colnames(data)[which(names(data)=="Procedure.Block.")] <-"block"
data <- data[data$block!="TRAINING",] # excluding training data
data$block <- as.integer(substr(data$block,6,6)) # block as integer from 1 to 6
# merging performance data into single variables (RT and ACC)
data[is.na(data)] <- 0 # missing data (NA) as 0 to be summed
data$ACC <- rowSums(data[,which(substr(colnames(data),11,13)=="ACC")]) # accuracy
data$RT <- rowSums(data[,which(substr(colnames(data),11,12)=="RT")]) # reaction time
data[,paste("Stimulus",rep(1:6,each=2),".",rep(c("ACC","RT"),6),sep="")] <- NULL # removing stimulus-related columns
# letter stimulus
data$Image <- as.factor(gsub(".png","",data$Image))
colnames(data)[which(names(data)=="Image")] <-"letter"
# TrialType as factor
data$TrialType <- as.factor(data$TrialType)
# FIXING PROBLEMS WITH SINGLE PARTICIPANTS
# .............................................................
# # checks cases with No. of trials per block != 109
# for(ID in levels(as.factor(paste(data$ID,data$block)))){
# if(nrow(data[paste(data$ID,data$block)==ID,])!=108){
# print(paste(ID,nrow(data[paste(data$ID,data$block)==ID,]))) }}
# blocks 5 and 6 of HRV001 were saved as HRV003
data[data$ID=="HRV003" & data$block==5,][109:216,"ID"] <- "HRV001"
data[data$ID=="HRV003" & data$block==6,][109:216,"ID"] <- "HRV001"
# HRV031 performed block 6 before block 5, then again block 6 (invalid), then block 5 twice (second invalid)
data <- rbind(data[!(data$ID=="HRV031" & data$block==5),],
data[data$ID=="HRV031" & data$block==5,][1:108,]) # taking only 1st
data <- rbind(data[!(data$ID=="HRV031" & data$block==6),],
data[data$ID=="HRV031" & data$block==6,][1:108,]) # taking only 1st
data[data$ID=="HRV031" & data$block==5, "block"] <- 9 # inverting 5 and 6
data[data$ID=="HRV031" & data$block==6, "block"] <- 5
data[data$ID=="HRV031" & data$block==9, "block"] <- 6
# HRV042 made block 6 twice (first time invalid)
data <- rbind(data[!(data$ID=="HRV042" & data$block==6),],
data[data$ID=="HRV042" & data$block==6,][109:216,]) # taking only 2nd
# HRV060 made block 4 twice (first time invalid)
data <- rbind(data[!(data$ID=="HRV060" & data$block==4),],
data[data$ID=="HRV060" & data$block==4,][109:216,]) # taking only 2nd
# Block 1 is invalid for HRV066 (technical problem) --> marking everything as NA
data[data$ID=="HRV066" & data$block==1,c("ACC","RT")] <- NA
# HRV086 made block 5 twice (first time invalid)
data <- rbind(data[!(data$ID=="HRV086" & data$block==5),],
data[data$ID=="HRV086" & data$block==5,][109:216,]) # taking only 2nd
# FINAL RECODING AND SAVING
# .............................................................
# renaming Session as "day" and recoding block as either 1 (i.e., 1, 3, and 5) or 2 (i.e., 2, 4, and 6)
colnames(data)[which(names(data)=="Session")] <-"day"
data[data$block==3|data$block==5,"block"] <- 1
data[data$block==4|data$block==6,"block"] <- 2
return(data[order(data$ID,data$SessionDate),])
}
First, we recode participants’ identification codes
ID (currently corresponding to their e-mail addresses)
using the “HRVXXX” format (e.g., from ‘john.smith@gmail.com’ to ‘HRV001’). This is done with
the participantID_recoding_GONOGO() function. Again, since
the participants’ e-mail addresses are confidential, both the function
and the original dataset are not showed.
# loading and using the function to recode the ID variable
source("participantID_recoding_GONOGO.R")
(GNG <- participantID_recoding_GONOGO(GNG))[1:3,] # recoding GNG IDs and showing first three rows
Second, we recode the SessionDate and
SessionTime variables so that they indicate the date and
time and the time within day of each session, respectively. We can note
that all sessions were recorded within the data collection period (i.e.,
between November 2018 and November 2019). In terms of time within day,
we can note that most sessions were recorded between 16:15 and 18:15,
with only three sessions recorded before 15:30.
# SessionDate as date and time
GNG$sd <- as.POSIXct(NA)
GNG[substr(GNG$SessionDate,3,3)=="/","sd"] <- # SessionDate format as d/m/y
as.POSIXct(paste(GNG[substr(GNG$SessionDate,3,3)=="/","SessionDate"],
substr(GNG[substr(GNG$SessionDate,3,3)=="/","SessionTime"],12,19)),format="%m/%d/%Y %H:%M:%S",tz="GMT")
GNG[substr(GNG$SessionDate,3,3)=="-","sd"] <- # SessionDate format as m-d-y
as.POSIXct(paste(GNG[substr(GNG$SessionDate,3,3)=="-","SessionDate"],
substr(GNG[substr(GNG$SessionDate,3,3)=="-","SessionTime"],12,19)),format="%m-%d-%Y %H:%M:%S",tz="GMT")
GNG$SessionDate <- GNG$sd
# plotting SessionDate and SessionTime
par(mfrow=c(1,2))
hist(GNG[!duplicated(paste(GNG$ID,GNG$Session)),"SessionDate"],breaks="months",freq=TRUE,main="SessionDate")
hist(GNG[!duplicated(paste(GNG$ID,GNG$Session)),"SessionTime"],breaks="min",freq=TRUE,main="SessionTime")
Third, we use the GNGrecoding() function showed above to
recode the other variables in the GNG dataset, to filter
unuseful columns, and to correct some issues with specific participants
(see function details at the beginning of section 4.2).
(GNG <- GNGrecoding(GNG))[1:3,] # recoding GNG and showing first three rows
Fourth, we create the wide-form block-level
BLOCK dataset by averaging accuracy ACC and
reaction times RT across blocks. Indeed, each participant
performed the task once per day, for three consecutive
days, and each day the session was divided into two
108-trial blocks with identical stimuli but different NoGo
stimulus. Thus, whereas the GNG data has one row per trial
(i.e., 108 * 6 rows per participant), the BLOCK dataset
will have one row per block (i.e., 6 rows per participant). Note that
mean RT are computed by only considering “Go” trials with
ACC = 1.
# aggregating values by block (mean)
BLOCK <- cbind(summarySE(GNG[GNG$TrialType=="Go",], # accuracy for Go trials
measurevar="ACC",groupvars=c("ID","day","block"))[,c("ID","day","block","ACC")],
summarySE(GNG[GNG$TrialType=="No-go",], # accuracy for NoGo trials
measurevar="ACC",groupvars=c("ID","day","block"))[,"ACC"], # reaction times to Go trials
summarySE(GNG[GNG$TrialType=="Go" & (GNG$ACC==1 | is.na(GNG$ACC)),],
measurevar="RT",groupvars=c("ID","day","block"))[,c("RT","sd")]) # also keeping RT.sd
colnames(BLOCK)[4:ncol(BLOCK)] <- c("Go.ACC","NoGo.ACC","RT","RT.sd")
BLOCK[1:3,] # showing first three rows
Finally, we use the variables computed in the BLOCK
dataset to compute the intra-individual coefficient of
variation ICV as the ratio between the standard
deviation RT.sd and the mean reaction time RT
in each block. Again, this is done by only considering “Go” trials with
ACC = 1.
BLOCK$ICV <- BLOCK$RT.sd/BLOCK$RT # ICV = RT.sd / mean RT
Here, we clean and filter the data based on the inclusion criteria.
First, we inspect the original number of respondents and observations
in the GNG dataset. We can note that the Go/NoGo task was
performed by the whole sample of 105 participants that were actually
invited to participate to the daily protocol. We can also note that the
number of trials and blocks do not match with the protocol for two
participants (see section 4.3.3).
cat("Original No. of trial-by-trial data =",nrow(GNG),"in",nrow(BLOCK),"blocks from",nlevels(GNG$ID),"participants")
## Original No. of trial-by-trial data = 67392 in 624 blocks from 105 participants
# participants with less than 648 trials
for(ID in levels(GNG$ID)){ if(nrow(GNG[GNG$ID==ID,])!=648){ print(paste(ID,nrow(GNG[GNG$ID==ID,]),"trials"))} }
## [1] "HRV009 216 trials"
## [1] "HRV059 432 trials"
# participants with less than 6 blocks
for(ID in levels(BLOCK$ID)){ if(nrow(BLOCK[BLOCK$ID==ID,])!=6){ print(paste(ID,nrow(BLOCK[BLOCK$ID==ID,]),"blocks"))} }
## [1] "HRV009 2 blocks"
## [1] "HRV059 4 blocks"
First, we inspect the data for cases of double trials and blocks,
that is blocks with the same ID, day, and
block value or with more than 108 trials. We can see that
no double blocks or double trials are included in the
HRV dataset.
# double blocks
cat(nrow(BLOCK[duplicated(paste(BLOCK$ID,BLOCK$day,BLOCK$block)),]),"double blocks")
## 0 double blocks
# double trials
BLOCK$nTrials <- summarySE(GNG,measurevar="ACC",groupvars=c("ID","day","block"))[,"N"]
cat(nrow(BLOCK[BLOCK$nTrials > 108,]),"blocks with double trials")
## 0 blocks with double trials
Second, we apply our inclusion criteria to the task performance, that
is we consider reaction times to “Go” trials < 150 ms as
anticipations, and we recode them by setting
ACC = 0.
# detecting and filtering anticipations
cat("Recoding",nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,]),"anticipations (",
round(100*nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,])/
nrow(GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$ACC==1,]),2),"% ) by setting ACC = 0")
## Recoding 470 anticipations ( 0.96 % ) by setting ACC = 0
GNG[GNG$TrialType=="Go" & !is.na(GNG$ACC) & GNG$RT > 0 & GNG$RT < 150,"ACC"] <- 0
# updating BLOCK dataset
BLOCK$Go.ACC <- summarySE(GNG[GNG$TrialType=="Go" & (GNG$ACC==1 | is.na(GNG$ACC)),],
measurevar="ACC",groupvars=c("ID","day","block"))[,"ACC"]
BLOCK[,c("RT","RT.sd")] <- summarySE(GNG[GNG$TrialType=="Go" & (GNG$ACC==1 | is.na(GNG$ACC)),],
measurevar="RT",groupvars=c("ID","day","block"))[,c("RT","sd")]
BLOCK$ICV <- BLOCK$RT.sd/BLOCK$RT # ICV = RT.sd / mean RT
Third, we mark the 8 participants that were excluded for the reasons specified in section 1.3.4. As in that section, such participants are marked as ‘OUT’. Thus, from the initial 105 participants, the number of excluded participants is 8 (7.62%%), whereas the total number of included participants is 97 (92.38%).
# marking 8 excluded participants as "OUT"
memory <- GNG
GNG$ID <- as.character(GNG$ID)
excl <- c("HRV009","HRV011","HRV015","HRV027","HRV029","HRV046","HRV059","HRV066")
GNG[GNG$ID%in%excl,"ID"] <- gsub("HRV","OUT",GNG[GNG$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
GNG$ID <- as.factor(GNG$ID)
GNG <- GNG[order(GNG$ID,GNG$SessionDate),]
# same thing with the BLOCK dataset
memory.BLOCK <- BLOCK
BLOCK$ID <- as.character(BLOCK$ID)
BLOCK[BLOCK$ID%in%excl,"ID"] <- gsub("HRV","OUT",BLOCK[BLOCK$ID%in%excl,"ID"]) # marking excluded participants as "OUTXXX"
BLOCK$ID <- as.factor(BLOCK$ID)
BLOCK <- BLOCK[order(BLOCK$ID,BLOCK$day,BLOCK$block),]
# updating number of included participants
cat(nrow(GNG[substr(GNG$ID,1,3)=="HRV",]),"included observations in",nrow(BLOCK[substr(BLOCK$ID,1,3)=="HRV",]),"blocks from",
nlevels(as.factor(as.character(GNG[substr(GNG$ID,1,3)=="HRV","ID"]))),"included participants \nout of",
nrow(memory),"total observations in",nrow(memory.BLOCK),"blocks from",nrow(memory[!duplicated(memory$ID),]),
"invited participants")
## 62856 included observations in 582 blocks from 97 included participants
## out of 67392 total observations in 624 blocks from 105 invited participants
No further criteria are considered to flag potentially confounding cases in relation to the Go/NoGo data.
Now that we read, recoded, and filtered the four datasets collected in the study, we can merge them into the final datasets to be used for the analyses. Specifically, we create four final datasets:
ema includes data from the ESM,
HRV, and PrelQS dataset, with one row
per measurement of either HRV (5 ambulatory + 1
lab) or ESM ratings (1 ‘baseline’ + 5 ambulatory), whereas
PrelQS information is identically repeated in each row of a
given participant
between includes all the collected variables, with
one row per participant and by including the
participant averages (e.g., mean, median, SD) of each variable
gng includes all the collected variables, with
one row per trial of the Go/NoGo task, only including
the values from the last daily measurement from the ESM and
the HRV datasets
BLOCK includes all the collected variables, with
one row per block of the Go/NoGo task, only including
the values from the last daily measurement from the ESM and
the HRV datasets
First, … ARRIVATO QUI
NOTE PER LE ANALISI: nel valutare la distribuzione da usare: - ex-Gaussian, quasi normale con coda positiva (positively skewed), ma mu e sigma sono indipendenti - Gamma: come sopra ma mu e sigma sono correlate -> per vederlo conviene plottare RMSSD per diverse categorie di X e vedi se la variabilità aumenta con la media dell’RMSSD - nota: però la Gamma inizia a zero!
Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4-19. https://doi.org/10.1016/j.jesp.2015.07.006
de Geus, E. J., Gianaros, P. J., Brindle, R. C., Jennings, J. R., & Berntson, G. G. (2019). Should heart rate variability be “corrected” for heart rate? Biological, quantitative, and interpretive considerations. Psychophysiology, 56(2), e13287. https://doi.org/10.1111/psyp.13287
Geeraerts, J. (2020, May 20). Investigating careless responding detection techniques in experience sampling methods. Retrieved from https://osf.io/d798j/
Kaufmann, T., Sütterlin, S., Schulz, S. M., & Vögele, C. (2011). ARTiiFACT: A tool for heart rate artifact processing and heart rate variability analysis. Behavior Research Methods, 43(4), 1161–1170. https://doi.org/10.3758/s13428-011-0107-7
Xiong, H., Huang, Y., Barnes, L. E., & Gerber, M. S. (2016). Sensus: a cross-platform, general-purpose system for mobile crowdsensing in human-subject studies. Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 415–426. https://doi.org/10.1145/2971648.2971711